Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data
Syeda Nahida Akter, Shrimai Prabhumoye, Eric Nyberg, Mostofa Patwary, Mohammad Shoeybi, Yejin Choi, Bryan Catanzaro
2025-10-07

Summary
This paper investigates how and when to use data designed to improve a large language model's reasoning skills during its training process.
What's the problem?
Currently, most improvements to large language models' reasoning come from training them *after* they've been initially built, using specific reasoning examples. It's unclear if it's better to include these reasoning examples earlier, during the initial building phase (pretraining), and if doing so could actually hurt the model's ability to generalize to new situations. Also, because the initial training data for many of these models isn't public, it's hard to know how much reasoning data was used from the start.
What's the solution?
The researchers systematically tested different amounts, types, and qualities of reasoning data at various stages of training – both during the initial building (pretraining) and the later refinement (supervised fine-tuning). They found that including reasoning data early on in pretraining is crucial for building a strong foundation for reasoning, and that this early training can't be fully replaced by just adding more data later. They also discovered that pretraining benefits from a *variety* of reasoning examples, while the later refinement stage benefits more from *high-quality* examples. They showed that good pretraining data has effects that show up later, and that simply adding more data during refinement can sometimes undo the benefits of the early reasoning training.
Why it matters?
This research challenges the idea that language learning and reasoning are separate steps. It provides a guide for developers on how to best use data throughout the entire training process to create more capable and intelligent language models, suggesting a strategic approach to data allocation is key.
Abstract
The prevailing paradigm for enhancing the reasoning abilities of LLMs revolves around post-training on high-quality, reasoning-intensive data. While emerging literature suggests that reasoning data is increasingly incorporated also during the mid-training stage-a practice that is relatively more proprietary and less openly characterized-the role of such data in pretraining remains unclear. In particular, due to the opaqueness of pretraining corpora in most frontier models, the effect of reasoning data introduced at different phases of pre- and/or post-training is relatively less reported in the scientific literature. This raises several important questions: Is adding reasoning data earlier during pretraining any better than introducing it during post-training? Could earlier inclusion risk overfitting and harm generalization, or instead establish durable foundations that later fine-tuning cannot recover? We conduct the first systematic study of how reasoning data-varying in scale, diversity, and quality-affects LLM performance when introduced at different stages of training. We find that front-loading reasoning data into pretraining is critical (19% avg gain), establishing foundational capabilities that cannot be fully replicated by later-stage SFT, even with more data. We uncover an asymmetric principle for optimal data allocation: pretraining benefits most from broad diversity in reasoning patterns (11% avg gain), while SFT is more sensitive to data quality (15% avg gain). We show that high-quality pretraining data has latent effects, activated only after SFT, and that naively scaling SFT data can be detrimental, washing away the benefits of early reasoning injection. Our results challenge the conventional separation of language modeling and reasoning, providing a principled guide for strategically allocating data across the entire training pipeline to build more capable models.