mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

Marc Marone, Orion Weller, William Fleshman, Eugene Yang, Dawn Lawrie, Benjamin Van Durme

2025-09-12

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

Summary

This paper introduces mmBERT, a new multilingual language model designed to be really good at understanding and processing text in a huge number of languages, over 1800 in fact.

What's the problem?

While there's been a lot of progress in language models, especially those that can generate text, encoder-only models – which are good at understanding text for tasks like classifying or finding information – haven't gotten as much recent attention, particularly when it comes to handling many different languages. Existing multilingual models weren't performing as well as they could, especially for languages that don't have a lot of available data.

What's the solution?

The researchers created mmBERT by training it on a massive dataset of text from over 1800 languages. They used a clever training strategy where they gradually added more of the less common languages towards the end of the training process. They also adjusted how much of the text was hidden during training and how the model sampled information, which helped it learn more effectively. This approach allowed mmBERT to achieve performance comparable to much larger and more complex models like OpenAI's o3 and Google's Gemini 2.5 Pro.

Why it matters?

This work is important because it shows that you can build a powerful multilingual model, even without needing enormous amounts of data for every single language. By strategically adding low-resource languages during training, mmBERT significantly improves performance across the board, making it a valuable tool for tasks involving many different languages and potentially helping to bridge the gap in language technology for less-represented languages.

Abstract

Encoder-only languages models are frequently used for a variety of standard machine learning tasks, including classification and retrieval. However, there has been a lack of recent research for encoder models, especially with respect to multilingual models. We introduce mmBERT, an encoder-only language model pretrained on 3T tokens of multilingual text in over 1800 languages. To build mmBERT we introduce several novel elements, including an inverse mask ratio schedule and an inverse temperature sampling ratio. We add over 1700 low-resource languages to the data mix only during the decay phase, showing that it boosts performance dramatically and maximizes the gains from the relatively small amount of training data. Despite only including these low-resource languages in the short decay phase we achieve similar classification performance to models like OpenAI's o3 and Google's Gemini 2.5 Pro. Overall, we show that mmBERT significantly outperforms the previous generation of models on classification and retrieval tasks -- on both high and low-resource languages.

View Paper