EuroBERT: Scaling Multilingual Encoders for European Languages
Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M. Alves, André Martins, Ayoub Hammal, Caio Corro, Céline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, Gabriel Hautreux, João Alves, Kevin El-Haddad, Manuel Faysse, Maxime Peyrard, Nuno M. Guerreiro, Patrick Fernandes, Ricardo Rei, Pierre Colombo
2025-03-10
Summary
This paper talks about EuroBERT, a new family of AI models designed to understand and process multiple languages, especially those spoken in Europe and other widely used global languages
What's the problem?
Older AI models that could understand multiple languages were being left behind by newer, more advanced models that focus on generating text. However, many of the improvements in these newer models could also be useful for the older style of models
What's the solution?
The researchers created EuroBERT by taking the best parts of the newer text-generating models and applying them to the older style of language-understanding models. They trained EuroBERT on a huge amount of text in many different languages, and made it able to handle longer pieces of text than previous models. They also made sure it could do well on a wide range of tasks, including understanding math and computer code
Why it matters?
This matters because EuroBERT can help computers understand and work with many languages more effectively, which is crucial in our connected world. It can be used for things like searching for information across languages, analyzing sentiment in different cultures, and even helping with math and coding tasks. By making the model public, the researchers are allowing others to use and improve upon their work, which could lead to even better language understanding tools in the future
Abstract
General-purpose multilingual vector representations, used in retrieval, regression and classification, are traditionally obtained from bidirectional encoder models. Despite their wide applicability, encoders have been recently overshadowed by advances in generative decoder-only models. However, many innovations driving this progress are not inherently tied to decoders. In this paper, we revisit the development of multilingual encoders through the lens of these advances, and introduce EuroBERT, a family of multilingual encoders covering European and widely spoken global languages. Our models outperform existing alternatives across a diverse range of tasks, spanning multilingual capabilities, mathematics, and coding, and natively supporting sequences of up to 8,192 tokens. We also examine the design decisions behind EuroBERT, offering insights into our dataset composition and training pipeline. We publicly release the EuroBERT models, including intermediate training checkpoints, together with our training framework.