On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
Charlie Zhang, Graham Neubig, Xiang Yue
2025-12-09
Summary
This paper investigates whether using reinforcement learning to fine-tune large language models actually *improves* their ability to reason, or if they’re just getting better at things they already knew during their initial training. It tries to figure out exactly what part of the training process – the initial learning from a massive dataset, adjustments made during training, or the reinforcement learning step – is most responsible for improvements in reasoning.
What's the problem?
It’s hard to know if reinforcement learning is genuinely teaching language models to reason better. The initial training data is huge and we don’t fully understand it, and it’s difficult to pinpoint how reinforcement learning interacts with the knowledge the model already has. Basically, we can see *that* performance improves, but not *why* or *how*.
What's the solution?
The researchers created a special testing environment where they could control everything. They used simple, artificial reasoning problems with clear steps, and they carefully manipulated the data the models were trained on. This allowed them to isolate the effects of each stage of training – the initial pre-training, adjustments made mid-training, and the final reinforcement learning. They then tested how well the models could handle both more complex problems and problems presented in different ways.
Why it matters?
This research clarifies how to best train language models to reason effectively. It shows that reinforcement learning is most helpful when the model has a solid foundation from pre-training, and when the reinforcement learning focuses on challenging tasks that are just within the model’s reach. It also highlights the often-overlooked importance of adjustments made *during* the initial training process, and how carefully designed rewards can lead to more reliable reasoning.
Abstract
Recent reinforcement learning (RL) techniques have yielded impressive reasoning improvements in language models, yet it remains unclear whether post-training truly extends a model's reasoning ability beyond what it acquires during pre-training. A central challenge is the lack of control in modern training pipelines: large-scale pre-training corpora are opaque, mid-training is often underexamined, and RL objectives interact with unknown prior knowledge in complex ways. To resolve this ambiguity, we develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training. Our approach employs synthetic reasoning tasks with explicit atomic operations, parseable step-by-step reasoning traces, and systematic manipulation of training distributions. We evaluate models along two axes: extrapolative generalization to more complex compositions and contextual generalization across surface contexts. Using this framework, we reconcile competing views on RL's effectiveness. We show that: 1) RL produces true capability gains (pass@128) only when pre-training leaves sufficient headroom and when RL data target the model's edge of competence, tasks at the boundary that are difficult but not yet out of reach. 2) Contextual generalization requires minimal yet sufficient pre-training exposure, after which RL can reliably transfer. 3) Mid-training significantly enhances performance under fixed compute compared with RL only, demonstrating its central but underexplored role in training pipelines. 4) Process-level rewards reduce reward hacking and improve reasoning fidelity. Together, these results clarify the interplay between pre-training, mid-training, and RL, offering a foundation for understanding and improving reasoning LM training strategies.