Demystifying Reinforcement Learning in Agentic Reasoning
Zhaochen Yu, Ling Yang, Jiaru Zou, Shuicheng Yan, Mengdi Wang
2025-10-14
Summary
This paper investigates how to best use reinforcement learning to improve the reasoning abilities of large language models (LLMs) when those models are acting as 'agents' – meaning they're making decisions and using tools to achieve goals. It aims to figure out the best ways to train these agentic LLMs.
What's the problem?
Large language models are getting better at a lot of things, but when they need to *act* and use tools to solve complex problems, they often struggle. While using reinforcement learning to train them seems promising, it's not clear what the best practices are. Specifically, researchers didn't know what kind of data to use for training, which reinforcement learning algorithms worked best, or how the LLM should 'think' – should it plan a lot or just act quickly?
What's the solution?
The researchers systematically tested different approaches to training agentic LLMs. They found that using real-world examples of tool use (instead of artificially created ones) for initial training was much more effective. They also discovered that encouraging the LLM to explore different options during training, through techniques like allowing a wider range of actions and providing rewards for trying new things, significantly improved performance. Finally, they found that a more thoughtful, less chatty approach – where the LLM plans carefully and uses tools sparingly – worked better than constantly talking through its reasoning or using tools excessively.
Why it matters?
This work provides a practical guide for anyone trying to build agentic LLMs. It shows that you don't necessarily need huge models to get good results; by following these simple training techniques, even smaller models can perform as well as or better than much larger ones. The researchers also released the datasets they used, which will help other researchers build on this work and further improve agentic reasoning in LLMs.
Abstract
Recently, the emergence of agentic RL has showcased that RL could also effectively improve the agentic reasoning ability of LLMs, yet the key design principles and optimal practices remain unclear. In this work, we conduct a comprehensive and systematic investigation to demystify reinforcement learning in agentic reasoning from three key perspectives: data, algorithm, and reasoning mode. We highlight our key insights: (i) Replacing stitched synthetic trajectories with real end-to-end tool-use trajectories yields a far stronger SFT initialization; high-diversity, model-aware datasets sustain exploration and markedly improve RL performance. (ii) Exploration-friendly techniques are crucial for agentic RL, such as clip higher, overlong reward shaping, and maintaining adequate policy entropy could improve the training efficiency. (iii) A deliberative strategy with fewer tool calls outperforms frequent tool calls or verbose self-reasoning, improving tool efficiency and final accuracy. Together, these simple practices consistently enhance agentic reasoning and training efficiency, achieving strong results on challenging benchmarks with smaller models, and establishing a practical baseline for future agentic RL research. Beyond these empirical insights, we further contribute a high-quality, real end-to-end agentic SFT dataset along with a high-quality RL dataset, and demonstrate the effectiveness of our insights in boosting the agentic reasoning ability of LLMs across four challenging benchmarks, including AIME2024/AIME2025, GPQA-Diamond, and LiveCodeBench-v6. With our recipes, 4B-sized models could also achieve superior agentic reasoning performance compared to 32B-sized models. Code and models: https://github.com/Gen-Verse/Open-AgentRL