Reinforcement Mid-Training

Yijun Tian, Shaoyu Chen, Zhichao Xu, Yawei Wang, Jinhe Bi, Peng Han, Wei Wang

2025-10-09

Summary

This paper introduces a new step in training large language models, called 'reinforcement mid-training,' that happens between the initial learning phase and the final polishing phase. The researchers argue this middle step can significantly improve how well these models perform.

What's the problem?

Current large language models are trained in two stages, but this process isn't as efficient as it could be. The models sometimes get stuck 'thinking' too much, wasting resources on unnecessary steps. Also, they don't pay enough attention to all the words in a sentence – some words are more important than others, and the models don't always recognize that. Finally, they don't fully use all the information contained within each word itself.

What's the solution?

To fix these issues, the researchers developed a framework called RMT. RMT includes a system that limits how much 'thinking' the model does, preventing it from getting bogged down. It also uses a learning strategy that starts with simpler words and gradually moves to more complex ones. Finally, it combines traditional learning with a method that specifically focuses on important words, making sure the model learns from everything it sees.

Why it matters?

This research is important because it shows a way to make large language models much more effective with less computational effort. Their method achieved substantial performance gains – up to a 64.91% improvement – while actually reducing the amount of processing needed. It also suggests that improving the model *during* training, not just before or after, can lead to better results, especially in areas like math.

Abstract

The development of state-of-the-art large language models is commonly understood as a two-stage process involving pre-training and post-training. We point out the need for an additional intermediate stage called reinforcement mid-training with potential for strong performance gains. In this paper, we formally define the problem and identify three key challenges: (1) inefficient training due to excessive reasoning steps, (2) disregard of the imbalanced token entropy distribution, and (3) underutilization of token information. To address these challenges, we propose RMT, a framework for efficient, adaptive, and unified reinforcement mid-training with various innovative components. In particular, we first introduce a dynamic token budget mechanism that constrains unnecessary reasoning steps and mitigates model overthinking. Next, we design a curriculum-based adaptive sampling method that fosters a progressive learning trajectory from easy to hard tokens. Finally, we present a dual training strategy that combines reinforcement learning with next-token prediction, ensuring targeted learning on key tokens and full exploitation of all token information. Extensive experiments demonstrate the superiority of RMT over state-of-the-art methods, achieving up to +64.91% performance improvement with only 21% of the reasoning length in language modeling. We also show that checkpoints obtained after reinforcement mid-training can benefit the subsequent post-training, yielding up to +18.76% improvement in the mathematical domain.

View Paper