SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations
Wentao Guo, Mayank Mishra, Xinle Cheng, Ion Stoica, Tri Dao
2025-12-18
Summary
This paper introduces SonicMoE, a new method for making Mixture of Experts (MoE) models, which are really big language models, train faster and use less memory.
What's the problem?
MoE models are great for scaling up language models without making them super slow, but they have some issues. Making the experts smaller and using more of them (a recent trend) leads to needing a lot of memory for activations and inefficient use of the computer's processing power because of wasted calculations. Basically, they're hitting limits in how quickly they can be trained and how much they cost to run.
What's the solution?
The researchers developed SonicMoE, which tackles these problems in a few ways. First, they created a smarter way to calculate the forward and backward steps in the model, reducing the need to store lots of activation data. Second, they designed special code for GPUs that lets the computer do calculations and move data around at the same time, making things faster. Finally, they came up with a 'token rounding' technique that minimizes wasted processing power when dealing with many experts.
Why it matters?
SonicMoE is important because it significantly speeds up the training of these large language models and reduces the memory needed. They showed it can achieve similar training speeds to existing methods but with fewer GPUs, or even faster speeds with the same number of GPUs. This means researchers and companies can train even bigger and better language models more efficiently, ultimately leading to more powerful AI.
Abstract
Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost. Recent MoE models demonstrate a clear trend towards high expert granularity (smaller expert intermediate dimension) and higher sparsity (constant number of activated experts with higher number of total experts), which improve model quality per FLOP. However, fine-grained MoEs suffer from increased activation memory footprint and reduced hardware efficiency due to higher IO costs, while sparser MoEs suffer from wasted computations due to padding in Grouped GEMM kernels. In response, we propose a memory-efficient algorithm to compute the forward and backward passes of MoEs with minimal activation caching for the backward pass. We also design GPU kernels that overlap memory IO with computation benefiting all MoE architectures. Finally, we propose a novel "token rounding" method that minimizes the wasted compute due to padding in Grouped GEMM kernels. As a result, our method SonicMoE reduces activation memory by 45% and achieves a 1.86x compute throughput improvement on Hopper GPUs compared to ScatterMoE's BF16 MoE kernel for a fine-grained 7B MoE. Concretely, SonicMoE on 64 H100s achieves a training throughput of 213 billion tokens per day comparable to ScatterMoE's 225 billion tokens per day on 96 H100s for a 7B MoE model training with FSDP-2 using the lm-engine codebase. Under high MoE sparsity settings, our tile-aware token rounding algorithm yields an additional 1.16x speedup on kernel execution time compared to vanilla top-K routing while maintaining similar downstream performance. We open-source all our kernels to enable faster MoE model training.