MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion

Xintong Hao, Ke Shen, Chenggang Li

2025-02-07

MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus
Expansion

Summary

This paper talks about a new method called MAGA (MAssive Genre-Audience) that helps create more high-quality data for training large AI language models. It's designed to solve the problem of not having enough good data to make AI models even smarter.

What's the problem?

As AI language models get bigger and more powerful, they need huge amounts of high-quality text data to learn from. However, there's not enough of this good data available, which is slowing down the progress of making these AI models even better. It's like trying to teach a super-smart student, but running out of advanced books for them to read.

What's the solution?

The researchers created MAGA, a clever way to take existing good-quality text and rewrite it in many different styles for various audiences. This method created a massive new dataset called MAGACorpus with 770 billion words. They tested this new data on AI models of different sizes and found it consistently helped the models perform better. They also studied how to avoid problems that can happen when using artificially created training data.

Why it matters?

This research matters because it could help overcome a major roadblock in AI development. By finding a way to create more high-quality training data, MAGA could allow AI language models to keep improving and learning new skills. This could lead to smarter AI assistants, better language translation, and more advanced AI tools for various fields like education, science, and technology.

Abstract

Despite the remarkable capabilities of large language models across various tasks, their continued scaling faces a critical challenge: the scarcity of high-quality pretraining data. While model architectures continue to evolve, the natural language data struggles to scale up. To tackle this bottleneck, we propose MAssive Genre-Audience~(MAGA) reformulation method, which systematic synthesizes diverse, contextually-rich pretraining data from existing corpus. This work makes three main contributions: (1) We propose MAGA reformulation method, a lightweight and scalable approach for pretraining corpus expansion, and build a 770B tokens MAGACorpus. (2) We evaluate MAGACorpus with different data budget scaling strategies, demonstrating consistent improvements across various model sizes (134M-13B), establishing the necessity for next-generation large-scale synthetic pretraining language models. (3) Through comprehensive analysis, we investigate prompt engineering's impact on synthetic training collapse and reveal limitations in conventional collapse detection metrics using validation losses. Our work shows that MAGA can substantially expand training datasets while maintaining quality, offering a reliably pathway for scaling models beyond data limitations.

View Paper