Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management

Miao Lu, Weiwei Sun, Weihua Du, Zhan Ling, Xuesong Yao, Kang Liu, Jiecao Chen

2025-10-15

Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management

Summary

This paper explores how to train AI agents, powered by large language models, to use tools over a long period of time, even when there are limits to how much information the AI can remember at once.

What's the problem?

When you're teaching an AI to do a series of tasks using tools, it needs to remember what it's already done to make good decisions. However, large language models have a limited 'memory' – they can only consider a certain amount of text at a time. This becomes a huge problem when the task requires many steps, because the AI quickly runs out of space to keep track of everything, leading to it forgetting instructions, costing a lot of computing power, and ultimately failing to complete the task.

What's the solution?

The researchers came up with a way to help the AI 'summarize' its past actions and the information it's gathered while using tools. Essentially, the AI creates short summaries of what's happened so far, and uses those summaries instead of remembering every single detail. This keeps the amount of information the AI needs to process manageable, allowing it to work on much longer and more complex tasks. They developed a specific training method called SUPO that optimizes both how the AI uses tools *and* how it creates these summaries, all at the same time.

Why it matters?

This research is important because it allows us to build AI agents that can handle more complex, real-world problems that require many steps and a lot of information. By overcoming the 'memory' limitations of large language models, we can create AI that is more reliable, efficient, and capable of tackling tasks that were previously impossible.

Abstract

We study reinforcement learning (RL) fine-tuning of large language model (LLM) agents for long-horizon multi-turn tool use, where context length quickly becomes a fundamental bottleneck. Existing RL pipelines can suffer from degraded instruction following, excessive rollout costs, and most importantly, strict context limits. To address these challenges, we introduce summarization-based context management to training. In specific, it periodically compresses the tool using history by LLM-generated summaries that retain task-relevant information to keep a compact context while enabling the agent to scale beyond the fixed context window. Building on this formulation, we derive a policy gradient representation that seamlessly enables standard LLM RL infrastructures to optimize both tool-use behaviors as well as summarization strategies in an end-to-end fashion. We instantiate this framework with SUmmarization augmented Policy Optimization (SUPO), an LLM RL algorithm that enables long-horizon training beyond a fixed context limit. Experiments on interactive function calling and searching tasks demonstrate that SUPO significantly improves the success rate while maintaining the same or even lower working context length compared to baselines. We also demonstrate that for complex searching tasks, SUPO can further improve the evaluation performance when scaling test-time maximum round of summarization beyond that of training time. Our results establish summarization-based context management as a principled and scalable approach for training RL agents beyond a fixed context length limit.

View Paper