Learning Dynamics in Continual Pre-Training for Large Language Models

Xingjin Wang, Howe Tissue, Lu Wang, Linjing Li, Daniel Dajun Zeng

2025-05-13

Learning Dynamics in Continual Pre-Training for Large Language Models

Summary

This paper talks about how large language models continue to learn and improve when they are trained on new data over time, and it introduces a rule, called a scaling law, to predict how their performance changes during this process.

What's the problem?

The problem is that when language models are trained on new data after their initial training, their performance doesn't always improve in a simple way. Sometimes there are unexpected changes, especially if the new data is very different from what the model saw before or if the training settings, like the learning rate, are changed.

What's the solution?

The researchers studied how these models learn during continual pre-training and came up with a scaling law that helps explain and predict the ups and downs in performance. Their work takes into account things like shifts in the type of data and adjustments to how quickly the model learns.

Why it matters?

This matters because understanding how language models learn over time helps scientists and engineers make better decisions about how to train them, which leads to smarter, more reliable AI that can keep up with new information and changing needs.

Abstract

The study provides a scaling law for Continual Pre-Training (CPT) of large language models, characterizing the transition in performance curves and accounting for distribution shift and learning rate changes.

View Paper