SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning
Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, Bo An
2025-09-03
Summary
This paper focuses on improving how large language models (LLMs) solve complex problems by using external tools, like calculators or code interpreters. It specifically tackles the difficulties of teaching these models to use tools effectively over multiple steps in a problem-solving process.
What's the problem?
When you try to train an LLM to use tools repeatedly, like in a multi-step math problem, the training process often becomes unstable and the model's performance actually gets worse. This happens because the feedback from the tools changes the model's understanding in a way that leads to it generating unlikely or nonsensical responses. These bad responses then cause huge errors during training, essentially breaking the learning process.
What's the solution?
The researchers developed a method called SimpleTIR to fix this. It works by identifying and removing 'void turns' from the training data. A 'void turn' is when the model asks for a tool to be used, but the tool doesn't actually help solve the problem or provide a final answer. By getting rid of these unproductive steps, SimpleTIR prevents those large training errors and keeps the learning process stable.
Why it matters?
This research is important because it allows LLMs to become much better at complex reasoning tasks. The method significantly improved performance on challenging math problems, and importantly, it allows the model to learn creative problem-solving strategies on its own, like checking its own work or using different approaches to verify answers, without needing a lot of pre-programmed guidance.
Abstract
Large Language Models (LLMs) can significantly improve their reasoning capabilities by interacting with external tools, a paradigm known as Tool-Integrated Reasoning (TIR). However, extending TIR to multi-turn scenarios using Reinforcement Learning (RL) is often hindered by training instability and performance collapse. We identify that such instability is primarily caused by a distributional drift from external tool feedback, leading to the generation of low-probability tokens. This issue compounds over successive turns, causing catastrophic gradient norm explosions that derail the training process. To address this challenge, we introduce SimpleTIR , a plug-and-play algorithm that stabilizes multi-turn TIR training. Its core strategy is to identify and filter out trajectories containing void turns, i.e., turns that yield neither a code block nor a final answer. By removing these problematic trajectories from the policy update, SimpleTIR effectively blocks the harmful, high-magnitude gradients, thus stabilizing the learning dynamics. Extensive experiments show that SimpleTIR achieves state-of-the-art performance on challenging math reasoning benchmarks, notably elevating the AIME24 score from a text-only baseline of 22.1 to 50.5 when starting from the Qwen2.5-7B base model. Furthermore, by avoiding the constraints of supervised fine-tuning, SimpleTIR encourages the model to discover diverse and sophisticated reasoning patterns, such as self-correction and cross-validation.