RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

Zichun Yu, Chenyan Xiong

2025-10-14

RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

Summary

This paper introduces a new method called RePro for improving how large language models (LLMs) are trained. It focuses on getting more value out of existing training data, rather than constantly needing to find new data, which is becoming increasingly difficult.

What's the problem?

Large language models need massive amounts of high-quality data to learn effectively. Think of this data as fuel for the model. However, the easily accessible, high-quality data on the internet is starting to run out, making it harder and more expensive to train even better models. Simply using more data doesn't always help if the data isn't good enough.

What's the solution?

RePro tackles this problem by 'recycling' existing data. Instead of just using data as-is, they train a smaller language model to *rephrase* the original data in a way that makes it higher quality and more useful for training larger models. This rephrasing is guided by rewards that encourage both quality and faithfulness to the original meaning. They essentially created a system that can take old data and make it act like new, better data.

Why it matters?

This research is important because it offers a more efficient way to train powerful language models. RePro allows researchers to get significantly better performance from the same amount of data, or even less, compared to traditional methods. This means we can continue to improve LLMs without constantly needing to find and curate enormous new datasets, saving time and resources and making advancements more sustainable.

Abstract

High-quality pretraining data is the fossil fuel of large language models (LLMs), yet its reserves are running low for frontier models. In this paper, we introduce RePro, a novel web recycling method that trains a relatively small LM with reinforcement learning to generate effective and faithful rephrasings of pretraining data. Specifically, we design one quality reward and three faithfulness rewards, optimizing the LM rephraser to convert organic data into high-quality rephrasings while maintaining its core semantics and structure. In our experiment, we train a 4B rephraser to recycle 72B tokens sampled from DCLM-RefinedWeb. Pretraining results on 400M and 1.4B models demonstrate that RePro delivers 4.7%-14.0% relative accuracy gains over organic-only baseline on 22 downstream tasks. RePro also outperforms ReWire, the state-of-the-art web recycling method that prompts a 70B rephraser, as well as the organic baseline with a 4x larger data pool. Experiments with different amounts of recycled data highlight that RePro improves organic data efficiency by 2-3x. Individual and distributional analyses validate that RePro preserves more critical information and faithfully reflects the characteristics of organic data compared to prompting-based methods. Together, these results show that RePro provides an efficient and controllable path to effectively harness the fossil fuel of LLM pretraining. We open-source our code, rephraser, and recycled data at https://github.com/cxcscmu/RePro.

View Paper