Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota

2025-08-27

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

Summary

This research investigates how the design of modern large language models, specifically those using a 'Mixture-of-Experts' approach, impacts their ability to both memorize information and perform complex reasoning tasks.

What's the problem?

As language models get bigger, we've noticed that simply increasing their size doesn't always lead to better performance, especially when it comes to reasoning. Current rules of thumb for scaling up models were developed for older, simpler designs and don't fully account for the unique characteristics of Mixture-of-Experts models, which use a 'sparse' approach where not all parts of the model are used for every task. The researchers wanted to understand how this 'sparseness' affects a model's ability to learn and reason.

What's the solution?

The researchers trained a series of language models with different numbers of total parameters, the number of parameters actively used, and how those active parameters were selected (using a 'top-k' routing method). They kept the overall computational cost the same for all models. By carefully tracking how well each model learned during training, how well it performed on tests, and the difference between those two, they could isolate the effects of sparseness on memorization and reasoning. They also tried techniques like reinforcement learning to see if they could improve reasoning in sparse models, but found they didn't help much.

Why it matters?

This work is important because it shows that simply making language models bigger isn't enough to improve their reasoning abilities, especially when using the Mixture-of-Experts design. It highlights the need to rethink how we scale these models and suggests that focusing on how actively used parameters are selected is crucial for building models that can truly reason, not just memorize. The findings provide guidance for future model development and help explain why some large models struggle with complex tasks despite their size.

Abstract

Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization and reasoning. We train families of MoE Transformers that systematically vary total parameters, active parameters, and top-k routing while holding the compute budget fixed. For every model we record pre-training loss, downstream task loss, and task accuracy, allowing us to separate the train-test generalization gap from the loss-accuracy gap. Memorization benchmarks improve monotonically with total parameters, mirroring training loss. By contrast, reasoning performance saturates and can even regress despite continued gains in both total parameters and training loss. Altering top-k alone has little effect when active parameters are constant, and classic hyperparameters such as learning rate and initialization modulate the generalization gap in the same direction as sparsity. Neither post-training reinforcement learning (GRPO) nor extra test-time compute rescues the reasoning deficit of overly sparse models. Our model checkpoints, code and logs are open-source at https://github.com/rioyokotalab/optimal-sparsity.

View Paper