Agent-as-a-Judge: Evaluate Agents with Agents

Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber

2024-10-16

Agent-as-a-Judge: Evaluate Agents with Agents

Summary

This paper presents the Agent-as-a-Judge framework, a new method for evaluating AI systems (agents) by using other AI agents to assess their performance, particularly in tasks like code generation.

What's the problem?

Current methods for evaluating AI agents are not very effective. They often focus only on the final results of a task and ignore the step-by-step process that agents go through. Additionally, many evaluations require a lot of manual work, which can be time-consuming and expensive.

What's the solution?

The authors introduce the Agent-as-a-Judge framework, which allows one AI agent to evaluate another. This approach provides detailed feedback throughout the entire task-solving process instead of just looking at the end results. To demonstrate this framework, they created DevAI, a benchmark with 55 realistic tasks for AI development that includes detailed requirements. They tested this framework on three popular AI systems and found that it performs better than previous evaluation methods and is as reliable as human evaluations.

Why it matters?

This research is significant because it offers a more efficient and effective way to evaluate AI systems. By using AI to assess other AIs, the Agent-as-a-Judge framework can save time and costs while providing valuable feedback for improving AI performance. This advancement could lead to better and more capable AI systems in various fields.

Abstract

Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes -- ignoring the step-by-step nature of agentic systems, or require excessive manual labour. To address this, we introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We apply the Agent-as-a-Judge to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic automated AI development tasks. It includes rich manual annotations, like a total of 365 hierarchical user requirements. We benchmark three of the popular agentic systems using Agent-as-a-Judge and find it dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that Agent-as-a-Judge marks a concrete step forward for modern agentic systems -- by providing rich and reliable reward signals necessary for dynamic and scalable self-improvement.

View Paper