Practical Efficiency of Muon for Pretraining

Essential AI, Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, Khoi Nguyen, Kurt Smith, Michael Callahan, Michael Pust, Mohit Parmar, Peter Rushton, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Somanshu Singla, Tim Romanski

2025-05-06

Practical Efficiency of Muon for Pretraining

Summary

This paper talks about Muon, a new method for training AI models that makes the process faster and uses less data and computer power compared to older methods.

What's the problem?

Training big AI models usually takes a lot of time, computer resources, and careful tuning, which can be expensive and slow, especially when working with large amounts of data at once.

What's the solution?

The researchers introduced Muon, a smarter optimizer that helps AI models learn more efficiently, especially when training with big batches of data, and it works well with another tool called muP to make adjusting settings easier and less costly.

Why it matters?

This matters because it means AI can be trained faster and cheaper, making advanced technology more accessible for researchers, companies, and anyone who wants to build smart systems.

Abstract

Muon, a second-order optimizer, improves data efficiency and computational savings over AdamW, especially at large batch sizes, and combined with muP, it provides efficient hyperparameter transfer and minimal resource overhead.

View Paper