A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning

Ruiyi Wang, Prithviraj Ammanabrolu

2025-10-06

A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning

Summary

This paper investigates how to best train large language models, like ChatGPT, to act as 'agents' that can complete tasks over multiple steps using reinforcement learning. It's about figuring out what actually makes these AI agents successful, rather than just assuming what should work.

What's the problem?

Currently, there's a lot of confusion and inconsistency in how researchers are trying to train these AI agents. Different approaches are used without a clear understanding of *why* some work and others don't. There's no organized way to know which design choices are most important for building effective agents, especially when dealing with tasks that require reasoning and interacting with an environment.

What's the solution?

The researchers broke down the training process into three key areas: the environment the agent operates in, the reward system that tells the agent how well it's doing, and the agent's 'policy' – essentially, how it makes decisions. They then systematically tested different setups within each of these areas using several different simulated environments, including ones that mimic real-world situations like following instructions in a text-based game or writing computer code. They experimented with how complex the tasks were, how often the agent received rewards, and different methods for the agent to learn from those rewards. Ultimately, they created a set of guidelines, or a 'recipe', for training these agents effectively.

Why it matters?

This research is important because it provides a much-needed, clear framework for building AI agents that can reliably complete complex tasks. By identifying what works and what doesn't, it helps researchers and developers avoid wasting time on ineffective approaches and focus on the most promising strategies. This will accelerate progress in creating more capable and useful AI systems that can assist us in a variety of real-world applications.

Abstract

We study what actually works and what doesn't for training large language models as agents via multi-turn reinforcement learning. Despite rapid progress, existing frameworks and definitions are fragmented, and there is no systematic formulation or analysis of which design choices matter across tasks. We address this gap by first breaking down the design space into three inter-related pillars -- environment, reward, and policy -- and empirically derive a recipe for training LLM agents in situated textual domains. In particular, we test TextWorld and ALFWorld, popular domains for testing situated embodied reasoning, as well as SWE-Gym for more software engineering style tasks. (i) For the environment, we analyze the impacts of task complexity in terms of sizes of the state and action spaces as well as optimal solution length, finding that even simple environments within a domain can provide signal on how well an agent can generalize to more complex tasks. (ii) For the reward, we ablate relative reward sparsity, observing that while dense turn-level rewards accelerate training, performance and stability is highly dependent on the choice of RL algorithm. (iii) And for the agent's policy, we explore the interplay between reward sparsity and biased (PPO, GRPO) and unbiased (RLOO) policy gradient methods in addition to showing how to find the optimal Supervised Fine-tuning (SFT) to RL training ratio given a fixed budget. We distill these findings into a training recipe that guides co-design across the three pillars, facilitating research and practical efforts in multi-turn agentic RL. Code: https://github.com/pearls-lab/meow-tea-taro

View Paper