ToolRM: Outcome Reward Models for Tool-Calling Large Language Models
Mayank Agarwal, Ibrahim Abdelaziz, Kinjal Basu, Merve Unuvar, Luis A. Lastras, Yara Rizk, Pavan Kapanipathi
2025-09-16
Summary
This paper focuses on how to better train computer programs to understand when a large language model (LLM) is *effectively* using tools, like calculators or search engines, to solve problems.
What's the problem?
Currently, the systems used to judge how well an LLM is doing – called reward models – are really good at evaluating natural language, like essays or stories. However, they aren't very good at understanding if an LLM is correctly *using* a tool to get to the right answer. They might miss important clues that show the LLM is reasoning well with the tool, or that the tool is being used appropriately. This makes it hard to improve LLMs that rely on tools.
What's the solution?
The researchers created a new set of tests, called FC-RewardBench, specifically designed to measure how well reward models understand tool use. They then developed a new way to train reward models, using data generated by other LLMs, that focuses on the *outcome* of using the tool – did it actually help solve the problem? They trained several models of different sizes and found these new models were significantly better at judging tool use than existing methods, leading to better performance on various tasks.
Why it matters?
This work is important because as LLMs become more capable of using tools to accomplish tasks, we need better ways to evaluate and improve their performance. Better reward models mean we can more effectively train LLMs to use tools correctly, leading to more reliable and helpful AI systems. This also allows for more efficient fine-tuning, meaning less data is needed to get good results.
Abstract
As large language models (LLMs) increasingly interact with external tools, reward modeling for tool use has become a critical yet underexplored area. Existing reward models, trained primarily on natural language outputs, struggle to evaluate tool-based reasoning and execution. To quantify this gap, we introduce FC-RewardBench, the first benchmark designed to systematically assess reward models' performance in tool-calling scenarios. Our analysis shows that current reward models often miss key signals of effective tool use, highlighting the need for domain-specific modeling. To address this, we propose a training framework for outcome-based reward models using data synthesized from permissively licensed, open-weight LLMs. We train models ranging from 1.7B to 14B parameters and evaluate them across seven out-of-domain benchmarks. These models consistently outperform general-purpose baselines, achieving up to 25\% average improvement in downstream task performance and enabling data-efficient fine-tuning through reward-guided filtering.