Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers
Maksim Velikanov, Ilyas Chahed, Jingwei Zuo, Dhia Eddine Rhaiem, Younes Belkada, Hakim Hacid
2026-01-09
Summary
This paper investigates a common practice in training large language models – using weight decay – and finds a way to improve it by letting the model itself decide how strongly to apply it.
What's the problem?
When training these huge models, a balance is struck between letting the model's weights grow (which helps it learn) and using weight decay to prevent them from getting too large (which helps it generalize). Previous research showed this balance results in a predictable weight size, but this paper argues that this predictable size isn't actually the *best* size, and is instead a side effect of how the model is trained.
What's the solution?
The researchers tackled this by adding a 'volume knob' to the weights, allowing the model to learn the optimal scale for its weights. They started by adding one knob for the entire weight matrix, and then got even more precise by adding individual knobs for each row and column. This allows for more flexibility than previous methods like muP multipliers, and doesn't require as much manual tuning.
Why it matters?
This work is important because it can lead to better performing large language models with the same amount of computation. The improvements are comparable to switching to a more advanced optimizer like Muon, but are achieved through a simpler change to the training process. It also reveals interesting insights into how the model organizes its internal calculations and how the size of its weights relates to its performance.
Abstract
Applying weight decay (WD) to matrix layers is standard practice in large-language-model pretraining. Prior work suggests that stochastic gradient noise induces a Brownian-like expansion of the weight matrices W, whose growth is counteracted by WD, leading to a WD-noise equilibrium with a certain weight norm ||W||. In this work, we view the equilibrium norm as a harmful artifact of the training procedure, and address it by introducing learnable multipliers to learn the optimal scale. First, we attach a learnable scalar multiplier to W and confirm that the WD-noise equilibrium norm is suboptimal: the learned scale adapts to data and improves performance. We then argue that individual row and column norms are similarly constrained, and free their scale by introducing learnable per-row and per-column multipliers. Our method can be viewed as a learnable, more expressive generalization of muP multipliers. It outperforms a well-tuned muP baseline, reduces the computational overhead of multiplier tuning, and surfaces practical questions such as forward-pass symmetries and the width-scaling of the learned multipliers. Finally, we validate learnable multipliers with both Adam and Muon optimizers, where it shows improvement in downstream evaluations matching the improvement of the switching from Adam to Muon.