MARS: Unleashing the Power of Variance Reduction for Training Large Models
Huizhuo Yuan, Yifeng Liu, Shuang Wu, Xun Zhou, Quanquan Gu
2024-11-18

Summary
This paper introduces MARS, a new optimization framework designed to improve the training of large neural networks by effectively reducing variance during the training process.
What's the problem?
Training deep learning models, especially large ones, requires efficient methods to update their parameters. Current adaptive algorithms like Adam and AdamW are widely used, but they often struggle with variance reduction, which can slow down the training process and make it less effective. Despite the development of various techniques to reduce variance, these methods have not been successfully applied to training large models.
What's the solution?
The authors propose MARS (Make vAriance Reduction Shine), which combines variance reduction techniques with existing gradient methods. MARS uses a scaled stochastic recursive momentum approach to create better gradient estimates that help in training. They introduced three versions of MARS that work with popular optimizers like AdamW, Lion, and Shampoo. Experimental results show that MARS outperforms traditional methods like AdamW in training speed and accuracy when fine-tuning models like GPT-2.
Why it matters?
This research is important because it provides a new way to train large models more efficiently, which can lead to faster development of AI technologies. By improving how we optimize these models, MARS could help researchers and developers create better-performing AI systems that can handle complex tasks more effectively.
Abstract
Training deep neural networks--and more recently, large models--demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. We also draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin.