Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

Junbo Li, Peng Zhou, Rui Meng, Meet P. Vadera, Lihong Li, Yang Li

2025-12-22

Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

Summary

This paper investigates how to best train large language models to act as agents that can interact with environments over multiple steps, like completing tasks in a simulated store or solving puzzles.

What's the problem?

Current methods for training these agents, specifically one called Group Relative Policy Optimization, struggle when the task requires planning and remembering information over many turns or steps. It's hard for the agent to learn a consistent strategy when it needs to think long-term and make decisions based on past interactions.

What's the solution?

The researchers found that a different training method, Proximal Policy Optimization, worked more reliably. They then improved upon this by creating a new version called turn-PPO. Instead of treating each word or action as a separate step, turn-PPO focuses on each complete 'turn' in the interaction, making it easier for the agent to understand the bigger picture and plan ahead. They tested this on tasks like online shopping and a block-pushing puzzle.

Why it matters?

This work is important because it helps us build more capable AI agents that can handle complex, real-world tasks that require reasoning and planning over extended periods. By making these agents more stable and effective, we can move closer to AI systems that can truly assist us in a variety of situations.

Abstract

Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn tasks exposes notable limitations, particularly in scenarios requiring long-horizon reasoning. To address these challenges, we investigate more stable and effective advantage estimation strategies, especially for multi-turn settings. We first explore Proximal Policy Optimization (PPO) as an alternative and find it to be more robust than GRPO. To further enhance PPO in multi-turn scenarios, we introduce turn-PPO, a variant that operates on a turn-level MDP formulation, as opposed to the commonly used token-level MDP. Our results on the WebShop and Sokoban datasets demonstrate the effectiveness of turn-PPO, both with and without long reasoning components.

View Paper