The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf
2024-06-26

Summary
This paper introduces FineWeb, a massive dataset containing 15 trillion tokens of text, designed to improve the performance of large language models (LLMs) by providing high-quality pretraining data. It also presents FineWeb-Edu, a specialized subset focused on educational content.
What's the problem?
The effectiveness of LLMs heavily relies on the quality and size of the datasets used for their training. However, many state-of-the-art models, like Llama 3 and Mixtral, do not share details about their training datasets, making it difficult for researchers to understand how to create effective datasets themselves.
What's the solution?
The authors created FineWeb by collecting data from 96 snapshots of Common Crawl, a web archive. They focused on filtering and deduplicating the data to ensure that only high-quality text was included. This careful curation process resulted in a dataset that outperforms other available datasets. Additionally, they developed FineWeb-Edu, which contains 1.3 trillion tokens specifically aimed at educational content. Models trained on this dataset showed significant improvements in tasks requiring knowledge and reasoning.
Why it matters?
This research is important because it provides a transparent approach to creating high-quality datasets for training LLMs. By sharing their methods and releasing the FineWeb datasets, the authors contribute valuable resources that can help improve future AI models, making them more effective in understanding and generating human-like text.
Abstract
The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb. LLMs pretrained on FineWeb-Edu exhibit dramatically better performance on knowledge- and reasoning-intensive benchmarks like MMLU and ARC. Along with our datasets, we publicly release our data curation codebase and all of the models trained during our ablation experiments.