RedPajama: an Open Dataset for Training Large Language Models
Maurice Weber, Daniel Fu, Quentin Anthony, Yonatan Oren, Shane Adams, Anton Alexandrov, Xiaozhong Lyu, Huu Nguyen, Xiaozhe Yao, Virginia Adams, Ben Athiwaratkun, Rahul Chalamala, Kezhen Chen, Max Ryabinin, Tri Dao, Percy Liang, Christopher Ré, Irina Rish, Ce Zhang
2024-11-20

Summary
This paper discusses RedPajama, an open dataset designed for training large language models, which aims to improve transparency and accessibility in the development of AI technologies.
What's the problem?
As large language models (LLMs) become essential in AI, there is a lack of transparency in how these models are trained and what data they use. Many existing models do not disclose their datasets, making it hard for researchers to understand or replicate their results. This lack of clarity can hinder the development of open-source language models, which are crucial for advancing AI research.
What's the solution?
To tackle these issues, the authors created two versions of the RedPajama dataset. RedPajama-V1 is a transparent reproduction of the LLaMA training dataset, while RedPajama-V2 consists of a massive collection of web data that includes over 30 trillion tokens. These datasets come with quality signals and metadata that help researchers filter and analyze the data effectively. The goal is to provide a comprehensive resource that supports the development of high-quality open-source language models.
Why it matters?
This research is important because it promotes transparency in AI development, allowing researchers to access high-quality data for training language models. By providing detailed datasets and quality signals, RedPajama can help improve the performance and reliability of AI systems, making advanced technology more accessible to everyone.
Abstract
Large language models are increasingly becoming a cornerstone technology in artificial intelligence, the sciences, and society as a whole, yet the optimal strategies for dataset composition and filtering remain largely elusive. Many of the top-performing models lack transparency in their dataset curation and model development processes, posing an obstacle to the development of fully open language models. In this paper, we identify three core data-related challenges that must be addressed to advance open-source language models. These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis. To address these challenges, we release RedPajama-V1, an open reproduction of the LLaMA training dataset. In addition, we release RedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata. Together, the RedPajama datasets comprise over 100 trillion tokens spanning multiple domains and with their quality signals facilitate the filtering of data, aiming to inspire the development of numerous new datasets. To date, these datasets have already been used in the training of strong language models used in production, such as Snowflake Arctic, Salesforce's XGen and AI2's OLMo. To provide insight into the quality of RedPajama, we present a series of analyses and ablation studies with decoder-only language models with up to 1.6B parameters. Our findings demonstrate how quality signals for web data can be effectively leveraged to curate high-quality subsets of the dataset, underscoring the potential of RedPajama to advance the development of transparent and high-performing language models at scale.