NorMuon: Making Muon more efficient and scalable

Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, Tuo Zhao

2025-10-09

NorMuon: Making Muon more efficient and scalable

Summary

This paper introduces a new optimizer called NorMuon for training large language models, aiming to improve upon existing methods like Adam and Muon by combining their strengths.

What's the problem?

Training really big language models is computationally expensive and takes a long time. Current optimizers, like Adam, have limitations in how efficiently they update the model's parameters. A newer optimizer, Muon, improves how parameters are updated to make training more stable, but it can lead to some neurons in the model being much more influential than others, creating an imbalance. Essentially, Muon makes things more stable but doesn't necessarily use all parts of the model equally.

What's the solution?

The researchers created NorMuon, which builds on Muon by adding a way to balance the influence of each neuron during training. It does this by tracking how each neuron contributes to the updates and then normalizing those contributions. This ensures that all neurons are used effectively while still benefiting from Muon’s improved stability. They also figured out how to make NorMuon work efficiently on multiple computers at the same time, which is crucial for training these massive models.

Why it matters?

NorMuon is important because it consistently trains large language models faster and more efficiently than both Adam and Muon. It shows that combining different optimization techniques – specifically, making updates stable *and* ensuring all parts of the model are used – is a promising direction for improving how we train these powerful AI systems. This could lead to faster development and deployment of even more advanced language models.

Abstract

The choice of optimizer significantly impacts the training efficiency and computational costs of large language models (LLMs). Recently, the Muon optimizer has demonstrated promising results by orthogonalizing parameter updates, improving optimization geometry through better conditioning. Despite Muon's emergence as a candidate successor to Adam, the potential for jointly leveraging their strengths has not been systematically explored. In this work, we bridge this gap by proposing NorMuon (Neuron-wise Normalized Muon), an optimizer that synergistically combines orthogonalization with neuron-level adaptive learning rates. Our analysis reveals that while Muon effectively reduces condition numbers, the resulting updates exhibit highly non-uniform neuron norms, causing certain neurons to dominate the optimization process. NorMuon addresses this imbalance by maintaining second-order momentum statistics for each neuron and applying row-wise normalization after orthogonalization, ensuring balanced parameter utilization while preserving Muon's conditioning benefits. To enable practical deployment at scale, we develop an efficient distributed implementation under the FSDP2 framework that strategically distributes orthogonalization computations across devices. Experiments across multiple model scales demonstrate that NorMuon consistently outperforms both Adam and Muon, achieving 21.74% better training efficiency than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting, while maintaining a comparable memory footprint to Muon. Our findings suggest that orthogonalization and adaptive learning rates are complementary rather than competing approaches, opening new avenues for optimizer design in large-scale deep learning.

View Paper