Optimal Scaling Needs Optimal Norm

Oleg Filatov, Jiangtao Wang, Jan Ebert, Stefan Kesselheim

2025-10-07

Summary

This research investigates how to best adjust training settings, specifically the learning rate and batch size, when you're building and training large language models (LLMs) with varying sizes and using different amounts of data.

What's the problem?

Currently, there's no single, clear rule to explain how to optimally scale the learning rate and batch size as you increase the size of the model and the amount of training data. Researchers have made progress, but haven't found a unifying principle that always works, making it difficult to efficiently train these massive models.

What's the solution?

The researchers discovered a key pattern using a specific optimizer called Scion. They found that the best learning rate and batch size combination consistently results in a specific value for something called the 'operator norm' of the final layer of the neural network. This 'constant norm' acts as a guide – if you maintain this norm while scaling up the model and data, you're likely to get good results. They also figured out how the optimal learning rate and batch size change as you increase the dataset size, and this scaling matched what's already known for another popular optimizer, Adam. Finally, they found that adjusting learning rates for different parts of the model, especially lowering them for earlier layers, improves performance.

Why it matters?

This work is important because it provides a practical rule of thumb – focusing on maintaining a constant operator norm – for efficiently training very large language models. This can save significant time and resources, as finding the right training settings is often a trial-and-error process. The researchers also released their training setup and data, allowing other researchers to build upon their findings and further explore LLM training.

Abstract

Despite recent progress in optimal hyperparameter transfer under model and dataset scaling, no unifying explanatory principle has been established. Using the Scion optimizer, we discover that joint optimal scaling across model and dataset sizes is governed by a single invariant: the operator norm of the output layer. Across models with up to 1.3B parameters trained on up to 138B tokens, the optimal learning rate/batch size pair (eta^{ast}, B^{ast}) consistently has the same operator norm value - a phenomenon we term norm transfer. This constant norm condition is necessary but not sufficient: while for each dataset size, multiple (eta, B) reach the optimal norm, only a unique (eta^{ast}, B^{ast}) achieves the best loss. As a sufficient condition, we provide the first measurement of (eta^{ast}, B^{ast}) scaling with dataset size for Scion, and find that the scaling rules are consistent with those of the Adam optimizer. Tuning per-layer-group learning rates also improves model performance, with the output layer being the most sensitive and hidden layers benefiting from lower learning rates. We provide practical insights on norm-guided optimal scaling and release our Distributed Scion (Disco) implementation with logs from over two thousand runs to support research on LLM training dynamics at scale.

View Paper