On Surprising Effectiveness of Masking Updates in Adaptive Optimizers
Taejong Joo, Wenhan Xia, Cheolmin Kim, Ming Zhang, Eugene Ie
2026-02-18
Summary
This paper investigates a new way to train large language models, challenging the common practice of using complex optimization methods. It proposes a surprisingly simple technique – randomly blocking some of the changes made to the model's parameters during training – and shows it can actually work better than current state-of-the-art methods.
What's the problem?
Training really large language models is computationally expensive and relies on sophisticated optimizers that try to efficiently adjust the model's parameters. These optimizers are getting more and more complex, but it's not clear if that complexity is actually necessary or even helpful. The existing methods can still struggle to find the best settings for the model, leading to suboptimal performance.
What's the solution?
The researchers found that randomly 'masking' or blocking some of the parameter updates during training, even with a simpler optimizer like RMSProp, can be very effective. They realized this masking acts like a smoothing process, making the training more stable. Building on this, they developed a method called Magma, which intelligently decides *which* updates to mask based on the model's momentum and gradient, making the process even more efficient. Magma can be easily swapped in for existing optimizers without requiring major changes to the training process.
Why it matters?
This research is important because it suggests we might not need increasingly complex optimizers to train powerful language models. Magma offers a simpler, faster, and potentially more effective alternative, reducing the computational cost and improving the performance of these models. The significant improvements in perplexity, a measure of how well the model predicts text, demonstrate the potential of this approach for building better AI systems.
Abstract
Training large language models (LLMs) relies almost exclusively on dense adaptive optimizers with increasingly sophisticated preconditioners. We challenge this by showing that randomly masking parameter updates can be highly effective, with a masked variant of RMSProp consistently outperforming recent state-of-the-art optimizers. Our analysis reveals that the random masking induces a curvature-dependent geometric regularization that smooths the optimization trajectory. Motivated by this finding, we introduce Momentum-aligned gradient masking (Magma), which modulates the masked updates using momentum-gradient alignment. Extensive LLM pre-training experiments show that Magma is a simple drop-in replacement for adaptive optimizers with consistent gains and negligible computational overhead. Notably, for the 1B model size, Magma reduces perplexity by over 19\% and 9\% compared to Adam and Muon, respectively.