Benchmarking Optimizers for Large Language Model Pretraining

Andrei Semenov, Matteo Pagliardini, Martin Jaggi

2025-09-03

Benchmarking Optimizers for Large Language Model Pretraining

Summary

This paper investigates different methods for training large language models, like the ones powering chatbots, more efficiently.

What's the problem?

Recently, many new techniques have been proposed to speed up and improve the training of these large language models, with claims of faster learning and less need for careful adjustments. However, because each technique was tested in a different way, it's been hard to tell which ones *actually* work best and under what conditions. It's like everyone is running a different race with different rules, so you can't compare the winners.

What's the solution?

The researchers performed a large set of experiments, carefully testing several of these new training methods. They made sure to test them all using the same setup, changing things like the size of the model, how much data was processed at once, and how long the training lasted. They fine-tuned each method to get the best possible performance and then compared the results to see which optimizers worked best in different situations. They also made their code publicly available so others can verify and build upon their work.

Why it matters?

This research is important because it provides clear guidance to people who are actually training these large language models, helping them choose the best method for their specific needs. It also points researchers towards promising areas for future improvements in training techniques, and the open-source nature of the work encourages more rigorous and comparable research in the field.

Abstract

The recent development of Large Language Models (LLMs) has been accompanied by an effervescence of novel ideas and methods to better optimize the loss of deep learning models. Claims from those methods are myriad: from faster convergence to removing reliance on certain hyperparameters. However, the diverse experimental protocols used to validate these claims make direct comparisons between methods challenging. This study presents a comprehensive evaluation of recent optimization techniques across standardized LLM pretraining scenarios, systematically varying model size, batch size, and training duration. Through careful tuning of each method, we provide guidance to practitioners on which optimizer is best suited for each scenario. For researchers, our work highlights promising directions for future optimization research. Finally, by releasing our code and making all experiments fully reproducible, we hope our efforts can help the development and rigorous benchmarking of future methods.

View Paper