Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs

Zhongyang Li, Ziyue Li, Tianyi Zhou

2025-11-11

Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs

Summary

This paper focuses on improving how large language models that use a 'mixture of experts' system decide which parts of the model to use for different tasks. These models are powerful, but the way they route information to the experts isn't always optimal.

What's the problem?

Large language models are getting bigger and better by using a technique called 'mixture of experts,' where different parts of the model specialize in different things. However, the system that decides *which* expert to use for a given task, called the 'router,' isn't very good at consistently making the best choices. This leads to the model performing significantly worse than it potentially could, creating a noticeable gap in accuracy.

What's the solution?

The researchers developed a method called 'Routing Manifold Alignment' (RoMA) to fix this. RoMA works by subtly adjusting the router during a quick retraining process, while keeping the rest of the model frozen. It does this by encouraging the router to make similar choices for tasks that are similar to each other. Essentially, it aligns the way the router 'thinks' about tasks with how the model understands those tasks, making the routing more logical and consistent. It does this by looking at examples where the router made a good choice (leading to a correct answer) and encouraging it to make similar choices for similar inputs.

Why it matters?

This research is important because it shows a way to significantly improve the performance of already powerful large language models without requiring a massive amount of retraining. By making the routing system more efficient and aligned with task understanding, RoMA helps these models generalize better and achieve higher accuracy, unlocking more of their potential.

Abstract

Sparse Mixture-of-Experts (MoE) have been widely adopted in recent large language models since it can efficiently scale up the model capability without increasing the inference cost. However, evaluations on broad downstream tasks reveal a consistent suboptimality of the routers in existing MoE LLMs, which results in a severe performance gap (e.g., 10-20% in accuracy) to the optimal routing. In this paper, we show that aligning the manifold of routing weights with that of task embedding can effectively reduce the gap and improve MoE LLMs' generalization performance. Our method, "Routing Manifold Alignment (RoMA)", introduces an additional manifold regularization term in the post-training objective and only requires lightweight finetuning of routers (with other parameters frozen). Specifically, the regularization encourages the routing weights of each sample to be close to those of its successful neighbors (whose routing weights lead to correct answers) in a task embedding space. Consequently, samples targeting similar tasks will share similar expert choices across layers. Building such bindings between tasks and experts over different samples is essential to achieve better generalization. Moreover, RoMA demonstrates the advantage of unifying the task understanding (by embedding models) with solution generation (by MoE LLMs). In experiments, we finetune routers in OLMoE, DeepSeekMoE, and Qwen3-MoE using RoMA. Evaluations on diverse benchmarks and extensive comparisons with baselines show the substantial improvement brought by RoMA.

View Paper