Multilingual Routing in Mixture-of-Experts

Lucas Bandarkar, Chenyuan Yang, Mohsen Fayyaz, Junlin Hu, Nanyun Peng

2025-10-07

Multilingual Routing in Mixture-of-Experts

Summary

This paper investigates how Mixture-of-Experts (MoE) models, which are used to build very large language models, handle different languages. It looks at how the model decides which parts of itself to use when processing text in various languages.

What's the problem?

MoE models are powerful, but it's not well understood *how* they work with multiple languages. Specifically, researchers wanted to know if these models treat each language separately, or if there's some shared understanding across languages. The problem is that if a model doesn't share knowledge between languages, it might not perform well on languages it hasn't seen much of during training.

What's the solution?

The researchers analyzed how the MoE model routes different parts of a sentence to different 'experts' within the model, doing this for many languages at once. They found that early and late parts of the model process languages separately, but the middle layers show a surprising amount of overlap – the model uses similar experts for different languages in these middle layers. They then tried to *force* more overlap in the middle layers by making the model use the same experts it uses for English, and this actually improved performance in other languages.

Why it matters?

This research is important because it helps us understand how to build better multilingual language models. By understanding how these models process different languages, we can design them to share knowledge more effectively, leading to improved performance across a wider range of languages. The simple method they developed to improve performance is also significant, as it shows that even small changes can have a noticeable impact on multilingual capabilities.

Abstract

Mixture-of-Experts (MoE) architectures have become the key to scaling modern LLMs, yet little is understood about how their sparse routing dynamics respond to multilingual data. In this work, we analyze expert routing patterns using parallel multilingual datasets and present highly interpretable layer-wise phenomena. We find that MoE models route tokens in language-specific ways in the early and late decoder layers but exhibit significant cross-lingual routing alignment in middle layers, mirroring parameter-sharing trends observed in dense LLMs. In particular, we reveal a clear, strong correlation between a model's performance in a given language and how similarly its tokens are routed to English in these layers. Extending beyond correlation, we explore inference-time interventions that induce higher cross-lingual routing alignment. We introduce a method that steers the router by promoting middle-layer task experts frequently activated in English, and it successfully increases multilingual performance. These 1-2% gains are remarkably consistent across two evaluation tasks, three models, and 15+ languages, especially given that these simple interventions override routers of extensively trained, state-of-the-art LLMs. In comparison, interventions outside of the middle layers or targeting multilingual-specialized experts only yield performance degradation. Altogether, we present numerous findings that explain how MoEs process non-English text and demonstrate that generalization is limited by the model's ability to leverage language-universal experts in all languages.

View Paper