rStar2-Agent: Agentic Reasoning Technical Report

Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Weijiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, Ying Xin, Ziming Miao, Scarlett Li, Fan Yang, Mao Yang

2025-08-29

rStar2-Agent: Agentic Reasoning Technical Report

Summary

This paper introduces rStar2-Agent, a new artificial intelligence model designed to be really good at solving math problems. It's a 14 billion parameter model, meaning it's quite large, and it uses a technique called 'agentic reinforcement learning' to achieve top-level performance.

What's the problem?

Current AI models struggle with complex math problems that require multiple steps and the use of tools like Python code. They often make mistakes in their reasoning or don't know when to check their work. Training these models is also very expensive, requiring a lot of computing power and time. The 'noise' from using coding tools during the learning process can also throw them off.

What's the solution?

The researchers developed rStar2-Agent using three main ideas. First, they created a system that allows for fast and efficient training, even with limited computer resources. Second, they designed a new learning algorithm, GRPO-RoC, that helps the model deal with errors that happen when using code. This algorithm essentially lets the model try again if it gets something wrong. Finally, they figured out a smart way to train the model, starting with basic skills and gradually building up to more complex reasoning abilities. They started with a pre-existing model and improved it with only 510 training steps, taking about a week.

Why it matters?

rStar2-Agent is important because it significantly improves AI's ability to solve challenging math problems, even outperforming much larger models like DeepSeek-R1. It also shows that these advanced capabilities can be achieved with relatively modest computing resources. Beyond math, the model also performs well in other areas like understanding instructions, scientific reasoning, and using different tools, suggesting a broader impact for this type of AI.

Abstract

We introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance. Beyond current long CoT, the model demonstrates advanced cognitive behaviors, such as thinking carefully before using Python coding tools and reflecting on code execution feedback to autonomously explore, verify, and refine intermediate steps in complex problem-solving. This capability is enabled through three key innovations that makes agentic RL effective at scale: (i) an efficient RL infrastructure with a reliable Python code environment that supports high-throughput execution and mitigates the high rollout costs, enabling training on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agentic RL algorithm with a Resample-on-Correct rollout strategy that addresses the inherent environment noises from coding tools, allowing the model to reason more effectively in a code environment; (iii) An efficient agent training recipe that starts with non-reasoning SFT and progresses through multi-RL stages, yielding advanced cognitive abilities with minimal compute cost. To this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly shorter responses. Beyond mathematics, rStar2-Agent-14B also demonstrates strong generalization to alignment, scientific reasoning, and agentic tool-use tasks. Code and training recipes are available at https://github.com/microsoft/rStar.

View Paper