Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents
Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao, Yong Deng, Xiaofeng Wu, Zhenzhe Ying
2025-10-17
Summary
This paper focuses on improving how AI agents, powered by large language models, learn to use tools and find information over multiple steps, like when searching the internet to answer a complex question.
What's the problem?
When training these agents using reinforcement learning, it's hard to give them useful feedback. Traditional methods only reward the agent when it finally gets the right answer, which is too late for it to learn from its mistakes during the process. This is especially bad when the task takes many steps because the agent can get stuck, all attempts look equally good or bad, and it's difficult to figure out which steps were helpful and which weren't.
What's the solution?
The researchers developed a new training method called Information Gain-based Policy Optimization, or IGPO. Instead of just rewarding the final answer, IGPO gives the agent a small reward after *each* step it takes, based on how much closer that step brings the agent to knowing the correct answer. It figures this out by tracking how the agent's own confidence in its answer changes with each action, without needing extra information or complex calculations. This provides more frequent and helpful feedback.
Why it matters?
This research is important because it makes it easier to train AI agents to perform complex tasks that require multiple steps and searching for information. By providing more detailed feedback during the learning process, the agents learn faster and become more accurate, even when faced with new and unfamiliar situations.
Abstract
Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided at the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate two critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals, and (ii) lack of fine-grained credit assignment, where dependencies between turns are obscured, especially in long-horizon tasks. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy's probability of producing the correct answer. Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model's own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward trajectories. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved sample efficiency.