MARS-M: When Variance Reduction Meets Matrices

Yifeng Liu, Angela Yuan, Quanquan Gu

2025-10-28

MARS-M: When Variance Reduction Meets Matrices

Summary

This paper introduces a new optimization algorithm called MARS-M, designed to speed up the training of large neural networks, especially large language models.

What's the problem?

Training really big neural networks, like those used for language models, is slow and computationally expensive. Existing optimizers either focus on efficiently handling the massive calculations needed (like Muon) or on reducing the variability in the training process to speed things up (like MARS). However, no single method combines both of these advantages.

What's the solution?

The researchers combined the strengths of Muon and MARS into a single optimizer, MARS-M. They mathematically proved that MARS-M improves upon the speed at which it finds a good solution compared to using Muon alone. They also tested it on various tasks like language modeling and image recognition, showing it consistently performs better and reaches lower error rates.

Why it matters?

This work is important because it offers a faster and more efficient way to train the increasingly large and complex neural networks that power many modern AI applications. By reducing training time, researchers and developers can iterate more quickly and build even more powerful AI systems.

Abstract

Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language models (LLMs). On the other hand, recent benchmarks on optimizers for LLM pre-training have demonstrated that variance-reduction techniques such as MARS can achieve substantial speedups over standard optimizers that do not employ variance reduction. In this paper, to achieve the best of both worlds, we introduce MARS-M, a new optimizer that integrates the variance reduction technique in MARS with Muon. Under standard regularity conditions, we prove that Muon-M converges to a first-order stationary point at a rate of mathcal{O}(T^{-1/3}), which improves upon mathcal{O}(T^{-1/4}) rate attained by Muon. Our empirical results on language modeling and computer vision tasks demonstrate that MARS-M consistently yields lower losses and improved performance across various downstream benchmarks. The implementation of MARS-M is available at https://github.com/AGI-Arena/MARS/MARS_M.

View Paper