Agentic Reinforced Policy Optimization
Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou
2025-07-29
Summary
Agentic Reinforced Policy Optimization (ARPO) is a new method designed to help large language models (LLMs) think through complex problems that take many steps, especially when they need to use external tools. It improves how these models plan and decide what to do over longer sequences of actions by balancing their ability to think deeply and their skill in interacting with tools through multiple turns.
What's the problem?
Current reinforcement learning algorithms have trouble balancing two important skills in language models: the ability to think deeply over many steps (long-horizon reasoning) and the ability to interact with external tools multiple times (multi-turn tool interactions). This makes it hard for the models to perform well in realistic situations where both skills are needed.
What's the solution?
ARPO solves this problem by using a technique that pays attention to uncertainty during decision making. After the model uses a tool, if its behavior becomes uncertain, ARPO encourages the model to explore more options by branching out its possible decisions at those steps. It also helps the model learn which steps are more valuable by measuring the advantage or benefit of different tool-use decisions. This way, ARPO guides the model to improve its reasoning and tool-use step by step over multiple turns.
Why it matters?
This method matters because it makes large language models better at solving real-world problems that require many steps and tool use, doing so more efficiently. ARPO achieves better results using fewer resources and can work well with different sizes of models, making it a powerful way to train intelligent agents that can adapt in dynamic environments.
Abstract
Agentic Reinforced Policy Optimization (ARPO) enhances multi-turn reasoning in large language models by balancing long-horizon capabilities and tool interactions, using entropy-based adaptive rollouts and advantage attribution.