Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers
Yiran Zhao, Chaoqun Liu, Yue Deng, Jiahao Ying, Mahani Aljunied, Zhaodonghui Li, Lidong Bing, Hou Pong Chan, Yu Rong, Deli Zhao, Wenxuan Zhang
2025-03-06
Summary
This paper talks about Babel, a new AI language model that can understand and generate text in many languages, including ones that are often ignored by other AI systems
What's the problem?
Current AI language models mostly focus on popular languages like English or French, leaving out many widely spoken languages from less developed countries. This means a lot of people can't benefit from these AI tools
What's the solution?
The researchers created Babel, an AI model that can work with 25 of the most spoken languages in the world, covering over 90% of global speakers. They used a special technique to make Babel smarter and more efficient, and made two versions: a smaller one for everyday use and a larger one that's as good as some commercial AI models
Why it matters?
This matters because it makes advanced AI language tools available to billions more people who speak languages that are usually left out. It could help reduce inequality in access to AI technology and make it easier for people from different language backgrounds to use AI in their daily lives or work
Abstract
Large language models (LLMs) have revolutionized natural language processing (NLP), yet open-source multilingual LLMs remain scarce, with existing models often limited in language coverage. Such models typically prioritize well-resourced languages, while widely spoken but under-resourced languages are often overlooked. To address this disparity, we introduce Babel, an open multilingual LLM that covers the top 25 languages by number of speakers, supports over 90% of the global population, and includes many languages neglected by other open multilingual LLMs. Unlike traditional continue pretraining approaches, Babel expands its parameter count through a layer extension technique that elevates Babel's performance ceiling. We introduce two variants: Babel-9B, designed for efficient inference and fine-tuning, and Babel-83B, which sets a new standard for open multilingual LLMs. Extensive evaluations on multilingual tasks demonstrate its superior performance compared to open LLMs of comparable size. In addition, using open-source supervised <PRE_TAG>fine-tuning datasets</POST_TAG>, Babel achieves remarkable performance, with Babel-9B-Chat leading among 10B-sized LLMs and Babel-83B-Chat setting a new standard for multilingual tasks, reaching the same level of commercial models.