ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents

Dawei Li, Yuguang Yao, Zhen Tan, Huan Liu, Ruocheng Guo

2026-01-21

ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents

Summary

This paper introduces a new way to test how well computer programs can understand and give rewards to AI agents that use tools, like a calculator or a web browser.

What's the problem?

AI agents are getting better at using tools, and a key part of this is giving them rewards when they do things correctly at each step. However, there wasn't a good, standardized way to actually *test* if the system giving those rewards – called a process reward model – was working well, especially when the agent is trying to complete a complex task with multiple steps and tools.

What's the solution?

The researchers created a benchmark called ToolPRMBench. They took existing tasks where AI agents use tools and broke them down into individual steps. For each step, they identified the correct action, a common mistake, and information about the tool being used. They then used this data to test different reward systems, both general ones and ones specifically designed for tool use, by seeing if the reward system could correctly identify the right action versus the wrong one. They also used multiple AI models to double-check the accuracy of their test data.

Why it matters?

This work is important because it provides a reliable way to measure and improve the 'reward' systems used in AI agents that use tools. By having a good benchmark, researchers can develop better reward models, which will ultimately lead to AI agents that are more effective and reliable when using tools to solve problems.

Abstract

Reward-guided search methods have demonstrated strong potential in enhancing tool-using agents by effectively guiding sampling and exploration over complex action spaces. As a core design, those search methods utilize process reward models (PRMs) to provide step-level rewards, enabling more fine-grained monitoring. However, there is a lack of systematic and reliable evaluation benchmarks for PRMs in tool-using settings. In this paper, we introduce ToolPRMBench, a large-scale benchmark specifically designed to evaluate PRMs for tool-using agents. ToolPRMBench is built on top of several representative tool-using benchmarks and converts agent trajectories into step-level test cases. Each case contains the interaction history, a correct action, a plausible but incorrect alternative, and relevant tool metadata. We respectively utilize offline sampling to isolate local single-step errors and online sampling to capture realistic multi-step failures from full agent rollouts. A multi-LLM verification pipeline is proposed to reduce label noise and ensure data quality. We conduct extensive experiments across large language models, general PRMs, and tool-specialized PRMs on ToolPRMBench. The results reveal clear differences in PRM effectiveness and highlight the potential of specialized PRMs for tool-using. Code and data will be released at https://github.com/David-Li0406/ToolPRMBench.

View Paper