Robust Layerwise Scaling Rules by Proper Weight Decay Tuning
Zhiyuan Fan, Yifeng Liu, Qingyue Zhao, Angela Yuan, Quanquan Gu
2025-10-20
Summary
This paper tackles the challenge of efficiently training very large neural networks, specifically focusing on how to adjust settings like learning rate and weight decay when you change the size (width) of the network.
What's the problem?
When you scale up a neural network by making it wider, the standard ways of setting the learning rate and weight decay don't always work well. This is because of how normalization layers and the AdamW optimizer interact, causing the effective learning rate to change depending on the network's width. This breaks down a technique called 'maximal-update parameterization' (muP) which normally allows you to easily transfer settings between different network sizes.
What's the solution?
The researchers discovered a relationship between the size of the network, the learning rate, and the weight decay. They found that to keep the network stable and learning effectively as you change its width, you need to scale the weight decay proportionally to the width. They also suggest keeping the learning rate for certain parts of the network constant and using a specific learning rate rule for other parts. This allows you to train networks of different widths without needing to manually tune the learning rate and weight decay for each size.
Why it matters?
This work is important because it makes training large neural networks much more practical. Instead of spending a lot of time figuring out the best learning rate and weight decay for each network size, you can use a simple rule to transfer settings from a smaller 'proxy' network to a larger 'target' network. This saves a lot of computational resources and time, and it helps researchers build even more powerful AI models.
Abstract
Empirical scaling laws prescribe how to allocate parameters, data, and compute, while maximal-update parameterization (muP) enables learning-rate transfer across widths by equalizing early-time update magnitudes. However, in modern scale-invariant architectures, training quickly enters an optimizer-governed steady state where normalization layers create backward scale sensitivity and the effective learning rate becomes width dependent, degrading muP transfer. We address this by introducing a weight-decay scaling rule for AdamW that preserves sublayer gain across widths. Empirically, the singular-value spectrum of each matrix parameter scales in norm as eta/lambda with an approximately invariant shape; under width scaling d, we observe that the top singular value scales approximately as eta/lambdacdot d^{0.75}. Combining this observation with the muP learning-rate rule eta_2propto d^{-1} for matrix-like parameters implies an empirical weight-decay scaling rule lambda_2propto d that approximately keeps sublayer gains width invariant. Together with vector-like parameters trained at eta_1=Theta_d(1) and lambda_1=0, this yields zero-shot transfer of both learning rate and weight decay from proxy to target widths, removing per-width sweeps. We validate the rule on LLaMA-style Transformers and in a minimal synthetic setting, and we provide a simple diagnostic, matching top singular values, to check sublayer-gain invariance. Our results extend muP beyond the near-init regime by explicitly controlling steady-state scales set by the optimizer, offering a practical recipe for width-robust hyperparameter transfer under AdamW.