< Explain other AI papers

EuroLLM: Multilingual Language Models for Europe

Pedro Henrique Martins, Patrick Fernandes, João Alves, Nuno M. Guerreiro, Ricardo Rei, Duarte M. Alves, José Pombal, Amin Farajian, Manuel Faysse, Mateusz Klimaszewski, Pierre Colombo, Barry Haddow, José G. C. de Souza, Alexandra Birch, André F. T. Martins

2024-09-25

EuroLLM: Multilingual Language Models for Europe

Summary

This paper discusses the EuroLLM project, which aims to create multilingual large language models (LLMs) that can understand and generate text in all official European Union languages, as well as several other important languages. The project focuses on improving AI accessibility across Europe.

What's the problem?

Most existing large language models primarily focus on English and a few widely spoken languages, leaving many European languages underrepresented. This limits the ability of non-English speakers to access and benefit from AI technologies, which can create inequalities in information and communication.

What's the solution?

To address this issue, the researchers developed EuroLLM, a suite of open-weight multilingual models capable of processing multiple languages. They collected and filtered a vast dataset of 4 trillion tokens from various sources to train their models. They also created a multilingual tokenizer and released two initial models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct, which are designed to perform well on multilingual tasks and machine translation. The project emphasizes collaboration among European universities and research institutions to ensure quality and inclusivity.

Why it matters?

This research is significant because it promotes linguistic diversity and accessibility in AI technologies across Europe. By developing models that can understand and generate text in multiple languages, EuroLLM aims to empower users from different linguistic backgrounds, fostering better communication and information sharing. This initiative supports the broader goal of creating a competitive and innovative European AI ecosystem.

Abstract

The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date, detailing our data collection and filtering process, the development of scaling laws, the creation of our multilingual tokenizer, and the data mix and modeling configurations. Additionally, we release our initial models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct and report their performance on multilingual general benchmarks and machine translation.