AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories
Xing Han Lù, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina Stańczak, Peter Shaw, Christopher J. Pal, Siva Reddy
2025-04-15
Summary
This paper talks about AgentRewardBench, a new benchmark designed to test how well large language models (LLMs) can automatically judge the performance of web agents—software that completes tasks online, like shopping or posting on forums.
What's the problem?
The problem is that current automatic evaluation methods, especially rule-based systems, often don't accurately measure whether a web agent has really succeeded at its task. These rule-based methods can miss important details and usually underreport how well the agents are actually doing, which means researchers and developers might not get a true sense of an agent's abilities.
What's the solution?
The researchers created AgentRewardBench, which includes over a thousand examples of web agents performing different tasks. Each example is carefully reviewed by experts to set a standard for what success looks like. Then, they use this benchmark to compare how well different LLMs and rule-based methods judge the agents' performance, showing where the automatic systems fall short and which LLMs are more reliable for certain types of tasks.
Why it matters?
This work matters because it helps improve the way we test and trust AI agents that work on the web. By finding better ways to automatically evaluate these agents, AgentRewardBench can lead to smarter, more reliable AI systems that people can count on for real-world online tasks.
Abstract
AgentRewardBench is a benchmark evaluating the effectiveness of LLMs in assessing web agent performance, showing that rule-based methods may underreport success.