AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation
Wentao Shi, Yu Wang, Yuyang Zhao, Yuxin Chen, Fuli Feng, Xueyuan Hao, Xi Su, Qi Gu, Hui Su, Xunliang Cai, Xiangnan He
2026-04-22
Summary
This paper investigates how to reliably check if AI agents, specifically those powered by large language models, are behaving correctly when they're given complex tasks in different digital environments.
What's the problem?
As AI agents become more powerful and are used in more complicated situations, it's getting harder to be sure they're actually doing what they're supposed to. Current methods for checking their work either rely on very specific rules that don't work well in new situations, or they use another large language model to judge, which isn't always accurate or reliable. Simply asking another AI to judge isn't enough because it lacks concrete evidence.
What's the solution?
The researchers developed a new approach called 'Agent-as-a-Judge' where the AI agent actively *tests* its own work by interacting with the environment and using tools to gather proof of its actions. To evaluate this, they created a benchmark called AJ-Bench, which includes 155 different tasks across areas like web searching, working with databases, and using graphical interfaces. This benchmark specifically tests how well the agent can gather information, confirm the current state of things, and verify the steps it took to complete a task.
Why it matters?
This work is important because it provides a more robust way to verify the behavior of increasingly complex AI agents. If we can't trust that these agents are working correctly, it limits their usefulness in real-world applications. The AJ-Bench benchmark also provides a valuable tool for researchers to develop and improve these verification methods, ultimately leading to more reliable and trustworthy AI systems.
Abstract
As reinforcement learning continues to scale the training of large language model-based agents, reliably verifying agent behaviors in complex environments has become increasingly challenging. Existing approaches rely on rule-based verifiers or LLM-as-a-Judge models, which struggle to generalize beyond narrow domains. Agent-as-a-Judge addresses this limitation by actively interacting with environments and tools to acquire verifiable evidence, yet its capabilities remain underexplored. We introduce a benchmark AJ-Bench to systematically evaluate Agent-as-a-Judge across three domains-search, data systems, and graphical user interfaces-comprising 155 tasks and 516 annotated trajectories. The benchmark comprehensively assesses judge agents' abilities in information acquisition, state verification, and process verification. Experiments demonstrate consistent performance gains over LLM-as-a-Judge baselines, while also revealing substantial open challenges in agent-based verification. Our data and code are available at https://aj-bench.github.io/.