OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling
Zengzhi Wang, Fan Zhou, Xuefeng Li, Pengfei Liu
2025-06-26
Summary
This paper talks about OctoThinker, a new approach that helps language models get better at learning through reinforcement by focusing on special training in the middle of their learning process.
What's the problem?
The problem is that language models often have trouble improving their reasoning skills using reinforcement learning, especially on tough tasks like math, because they don't get the right kind of training data at the right time.
What's the solution?
The researchers explored different strategies during the middle phase of training and found that feeding the models high-quality math texts and clear examples of step-by-step reasoning helps them learn better. This led them to develop OctoThinker, which uses these ideas to boost the models' learning and reasoning abilities.
Why it matters?
This matters because it helps make language models smarter and more reliable at complex thinking tasks, like solving problems or explaining answers, which is useful for education, research, and many AI applications.
Abstract
Investigating mid-training strategies reveals that high-quality mathematical corpora and well-formatted chain-of-thought reasoning examples enhance reinforcement learning performance in language models, leading to the development of OctoThinker.