ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality
Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Caswell, Alex Pentland, Sercan Arik, Chen-Yu Lee, Sayna Ebrahimi
2025-10-29
Summary
This research investigates how the size of AI models and the amount of data they're trained on affects their performance, but importantly, it looks at this for many languages, not just English.
What's the problem?
Most studies on improving AI have focused on English, even though these AI systems are used by people all over the world who speak different languages. This means we don't really understand how to best build AI that works well for *everyone*, and scaling up models for many languages can be inefficient or lead to worse results.
What's the solution?
The researchers conducted a huge number of experiments – 774 to be exact – training AI models with different sizes (from small to quite large) on over 400 languages and testing them on 48 languages. They developed a new set of rules, called ATLAS, that predicts how well a model will perform based on its size and the data it's trained on, and this new rule works better than previous ones. They also figured out which languages help each other when learning, how to best add new languages to a model, and when it's better to start training a model from scratch versus building on an existing one.
Why it matters?
This work is important because it provides a scientific basis for building AI that works well in many languages, not just English. It helps developers create more effective and efficient AI systems for a global audience, making AI more accessible and useful to a wider range of people.
Abstract
Scaling laws research has focused overwhelmingly on English -- yet the most prominent AI models explicitly serve billions of international users. In this work, we undertake the largest multilingual scaling laws study to date, totaling 774 multilingual training experiments, spanning 10M-8B model parameters, 400+ training languages and 48 evaluation languages. We introduce the Adaptive Transfer Scaling Law (ATLAS) for both monolingual and multilingual pretraining, which outperforms existing scaling laws' out-of-sample generalization often by more than 0.3 R^2. Our analyses of the experiments shed light on multilingual learning dynamics, transfer properties between languages, and the curse of multilinguality. First, we derive a cross-lingual transfer matrix, empirically measuring mutual benefit scores between 38 x 38=1444 language pairs. Second, we derive a language-agnostic scaling law that reveals how to optimally scale model size and data when adding languages without sacrificing performance. Third, we identify the computational crossover points for when to pretrain from scratch versus finetune from multilingual checkpoints. We hope these findings provide the scientific foundation for democratizing scaling laws across languages, and enable practitioners to efficiently scale models -- beyond English-first AI.