The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training
Fabian Schaipp, Alexander Hägele, Adrien Taylor, Umut Simsekli, Francis Bach
2025-02-03

Summary
This paper talks about an unexpected connection between a complex math theory called convex optimization and how we train big AI models. The researchers found that the way we adjust the learning speed of AI models during training is very similar to what this math theory predicts would work best.
What's the problem?
Training large AI models is tricky, and one of the biggest challenges is figuring out how fast the model should learn at different stages of training. This is called the learning rate schedule. Until now, people have mostly figured this out through trial and error, without a solid mathematical explanation for why certain schedules work better than others.
What's the solution?
The researchers discovered that a mathematical theory called non-smooth convex optimization actually predicts the learning rate schedules that work best in practice. They focused on a specific schedule that starts with a constant learning rate and then gradually slows down at the end, called a 'linear cooldown'. They showed mathematically why this cooldown is helpful. Using this knowledge, they improved the training of large language models like Llama by extending the training time with the right learning rate and by applying what they learned about one schedule to other types of schedules.
Why it matters?
This matters because it gives us a mathematical explanation for something that AI researchers have been doing based mostly on intuition and experience. It means we can now use this theory to make better decisions about how to train AI models, potentially making them learn faster and perform better. For example, the researchers were able to improve the training of large language models, which are the kind of AI used in chatbots and other advanced AI applications. This could lead to more efficient ways of creating powerful AI systems in the future.
Abstract
We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant schedule with linear cooldown; in particular, the practical benefit of cooldown is reflected in the bound due to the absence of logarithmic terms. Further, we show that this surprisingly close match between optimization theory and practice can be exploited for learning-rate tuning: we achieve noticeable improvements for training 124M and 210M Llama-type models by (i) extending the schedule for continued training with optimal learning-rate, and (ii) transferring the optimal learning-rate across schedules.