Agent-as-a-Judge

Runyang You, Hongru Cai, Caiqi Zhang, Qiancheng Xu, Meng Liu, Tiezheng Yu, Yongqi Li, Wenjie Li

2026-01-09

Summary

This paper discusses how we're moving beyond simply using large language models (LLMs) to judge the quality of other AI systems, and instead are developing more sophisticated 'agent' systems that can evaluate AI in a more thorough and reliable way.

What's the problem?

Initially, using LLMs to evaluate other AI was a great step forward because it allowed for quick and large-scale assessments. However, LLMs have limitations; they can be biased, don't always think things through deeply, and can't easily check their judgments against the real world. As AI systems become more complex and require multiple steps to complete a task, these weaknesses become more apparent, making the evaluations less trustworthy.

What's the solution?

The paper introduces the idea of 'Agent-as-a-Judge,' where instead of a single LLM, we use AI 'agents' that can plan out how to evaluate something, use tools to verify information, work together with other agents, and remember past evaluations. The authors then provide a comprehensive overview of this new approach, categorizing the different methods being used and how they're applied in various fields. They essentially create a map of this rapidly evolving area.

Why it matters?

This work is important because it provides a much-needed structure for understanding and developing better AI evaluation methods. As AI becomes more powerful and integrated into our lives, it's crucial to have reliable ways to assess its performance and ensure it's safe and effective. This paper helps guide future research and development in this critical area, paving the way for more robust and trustworthy AI systems.

Abstract

LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.

View Paper