PORTool: Tool-Use LLM Training with Rewarded Tree
Feijie Wu, Weiwu Zhu, Yuxiang Zhang, Soumya Chatterjee, Jiarong Zhu, Fan Mo, Rodin Luo, Jing Gao
2025-10-31
Summary
This paper introduces a new method, PORTool, to improve how large language models use tools to answer questions. Current models are good at using tools, but they tend to follow a single, predictable path to a solution, even if better options exist.
What's the problem?
Existing large language models that use tools are trained on examples of how to solve problems. This makes them good at *imitating* how tools are used, but they don't really *explore* different ways to use those tools. Imagine you're trying to find a restaurant – the model might always use the same search steps, even if a different approach would be faster or more accurate. This is a problem because the real world is dynamic, and the best way to use tools can change over time.
What's the solution?
PORTool uses a technique called reinforcement learning to encourage the model to try out different sequences of tool use. It's like giving the model 'rewards' for finding the right answer and for successfully using tools. The key is that PORTool doesn't just reward the final answer; it rewards each *step* along the way, and importantly, it recognizes when different paths share initial steps. This allows the model to learn which tool-use strategies are generally good, and which are best for specific situations, leading to a more flexible and effective approach.
Why it matters?
This research is important because it makes tool-using language models more adaptable and accurate. By encouraging exploration, PORTool helps these models perform better in real-world scenarios where information changes and the best solution isn't always obvious. This could lead to more helpful and reliable AI assistants that can handle complex tasks involving various tools.
Abstract
Current tool-use large language models (LLMs) are trained on static datasets, enabling them to interact with external tools and perform multi-step, tool-integrated reasoning, which produces tool-call trajectories. However, these models imitate how a query is resolved in a generic tool-call routine, thereby failing to explore possible solutions and demonstrating limited performance in an evolved, dynamic tool-call environment. In this work, we propose PORTool, a reinforcement learning (RL) method that encourages a tool-use LLM to explore various trajectories yielding the correct answer. Specifically, this method starts with generating multiple rollouts for a given query, and some of them share the first few tool-call steps, thereby forming a tree-like structure. Next, we assign rewards to each step, based on its ability to produce a correct answer and make successful tool calls. A shared step across different trajectories receives the same reward, while different steps under the same fork receive different rewards. Finally, these step-wise rewards are used to calculate fork-relative advantages, blended with trajectory-relative advantages, to train the LLM for tool use. The experiments utilize 17 tools to address user queries, covering both time-sensitive and time-invariant topics. We conduct ablation studies to systematically justify the necessity and the design robustness of step-wise rewards. Furthermore, we compare the proposed PORTool with other training approaches and demonstrate significant improvements in final accuracy and the number of tool-call steps.