FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, Thomas Wolf
2025-06-26
Summary
This paper talks about FineWeb2, a new system that automatically prepares pre-training data for large language models in many different languages, making it easier to train these models on diverse and balanced language data.
What's the problem?
The problem is that training language models for multiple languages is hard because existing data processing methods usually focus on one language and don’t adapt well to other languages, which leads to poor performance and unbalanced datasets.
What's the solution?
The researchers designed FineWeb2 as a flexible pipeline that automatically adjusts how it collects, cleans, and organizes web data depending on the language, ensuring the quality of the data while handling the unique challenges of each language.
Why it matters?
This matters because it helps create better and fairer multilingual AI systems that can understand and generate text in many languages, which is important for global communication and making AI accessible to more people around the world.
Abstract
A new pre-training dataset curation pipeline based on FineWeb supports multilingual LLMs by automatically adapting to any language, improving model performance and balancing dataset quality.