Efficient Continual Pre-training by Mitigating the Stability Gap

Yiduo Guo, Jie Fu, Huishuai Zhang, Dongyan Zhao, Yikang Shen

2024-06-25

Efficient Continual Pre-training by Mitigating the Stability Gap

Summary

This paper discusses a new approach to continually pre-training large language models (LLMs) that helps them adapt better to new areas of knowledge, like medical information. It focuses on improving their performance while managing the challenges that come with shifting training data.

What's the problem?

When LLMs are continually pre-trained on new data, they often experience a temporary drop in performance at the beginning, known as the 'stability gap.' This happens because the model needs time to adjust to the new type of information it is learning, which can be frustrating and inefficient.

What's the solution?

To address this stability gap, the authors propose three strategies: (1) Pre-training the model on a smaller, properly sized subset of data for several rounds instead of all at once, which helps it recover faster; (2) Focusing on high-quality data that can quickly improve performance; and (3) Mixing in data that is similar to what the model was originally trained on to minimize the differences in training distribution. They tested these strategies using models from the Llama family and found significant improvements in performance, especially in medical tasks.

Why it matters?

This research is important because it provides effective methods for enhancing how LLMs learn from new information while avoiding common pitfalls like performance drops. The improved models, such as Llama-3-Physician, not only perform well on medical tasks but also compete with advanced models like GPT-4. This work could lead to better AI systems that are more efficient and capable of handling specialized knowledge.

Abstract

Continual pre-training has increasingly become the predominant approach for adapting Large Language Models (LLMs) to new domains. This process involves updating the pre-trained LLM with a corpus from a new domain, resulting in a shift in the training distribution. To study the behavior of LLMs during this shift, we measured the model's performance throughout the continual pre-training process. we observed a temporary performance drop at the beginning, followed by a recovery phase, a phenomenon known as the "stability gap," previously noted in vision models classifying new classes. To address this issue and enhance LLM performance within a fixed compute budget, we propose three effective strategies: (1) Continually pre-training the LLM on a subset with a proper size for multiple epochs, resulting in faster performance recovery than pre-training the LLM on a large corpus in a single epoch; (2) Pre-training the LLM only on high-quality sub-corpus, which rapidly boosts domain performance; and (3) Using a data mixture similar to the pre-training data to reduce distribution gap. We conduct various experiments on Llama-family models to validate the effectiveness of our strategies in both medical continual pre-training and instruction tuning. For example, our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget and enhance the average general task performance without causing forgetting. Furthermore, we apply our strategies to the Llama-3-8B model. The resulting model, Llama-3-Physician, achieves the best medical performance among current open-source models, and performs comparably to or even better than GPT-4 on several medical benchmarks. We release our models at https://huggingface.co/YiDuo1999/Llama-3-Physician-8B-Instruct.

View Paper