Fantastic Pretraining Optimizers and Where to Find Them
Kaiyue Wen, David Hall, Tengyu Ma, Percy Liang
2025-09-03
Summary
This paper investigates why, despite claims of being faster, alternative optimizers haven't replaced AdamW as the standard for training large language models.
What's the problem?
Researchers have suggested other optimizers can train language models significantly faster than AdamW, sometimes claiming speedups of 1.4 to 2 times. However, these claims haven't translated into widespread use. The issue is that previous comparisons between optimizers weren't fair because they didn't properly adjust settings for each optimizer, and they often looked at performance *during* training instead of at the very end when the model is fully trained. This can be misleading because some optimizers start strong but fall behind later.
What's the solution?
The researchers performed a careful and comprehensive study, testing ten different optimizers on language models of varying sizes (from small to fairly large, up to 1.2 billion parameters). They systematically adjusted the settings for *each* optimizer to find the best possible configuration, and they evaluated performance only after the models had finished training. They also tested the models with different amounts of training data to see how that affected the results.
Why it matters?
The study found that the speedups of alternative optimizers are often overstated and decrease as the model size increases. While some optimizers using a technique called 'matrix preconditioning' are faster, the advantage shrinks for larger models. This work highlights the importance of rigorous and fair comparisons when evaluating new optimization methods, and it explains why AdamW remains a strong choice, especially for large language models.
Abstract
AdamW has long been the dominant optimizer in language model pretraining, despite numerous claims that alternative optimizers offer 1.4 to 2x speedup. We posit that two methodological shortcomings have obscured fair comparisons and hindered practical adoption: (i) unequal hyperparameter tuning and (ii) limited or misleading evaluation setups. To address these two issues, we conduct a systematic study of ten deep learning optimizers across four model scales (0.1B-1.2B parameters) and data-to-model ratios (1-8x the Chinchilla optimum). We find that fair and informative comparisons require rigorous hyperparameter tuning and evaluations across a range of model scales and data-to-model ratios, performed at the end of training. First, optimal hyperparameters for one optimizer may be suboptimal for another, making blind hyperparameter transfer unfair. Second, the actual speedup of many proposed optimizers over well-tuned baselines is lower than claimed and decreases with model size to only 1.1x for 1.2B parameter models. Thirdly, comparing intermediate checkpoints before reaching the target training budgets can be misleading, as rankings between two optimizers can flip during training due to learning rate decay. Through our thorough investigation, we find that all the fastest optimizers such as Muon and Soap, use matrices as preconditioners -- multiplying gradients with matrices rather than entry-wise scalars. However, the speedup of matrix-based optimizers is inversely proportional to model scale, decreasing from 1.4x over AdamW for 0.1B parameter models to merely 1.1x for 1.2B parameter models.