Are We on the Right Way to Assessing LLM-as-a-Judge?
Yuanning Feng, Sinan Wang, Zhengxiang Cheng, Yao Wan, Dongping Chen
2025-12-22
Summary
This paper investigates how reliable large language models (LLMs) are when used to evaluate other LLMs, a process called 'LLM-as-a-Judge'. It points out flaws in current methods for testing these judging LLMs and proposes a new way to assess their performance without relying on human opinions.
What's the problem?
Currently, evaluating how well an LLM judges other LLMs relies heavily on humans providing 'correct' answers. This is problematic because human judgments can be biased and don't scale well – it takes a lot of time and effort to get enough human feedback. The paper argues that if the evaluation relies on potentially flawed human assessments, we can't truly know how trustworthy the LLM judge actually is.
What's the solution?
The researchers created a new evaluation tool called Sage. Instead of asking humans what the 'right' answer is, Sage checks if the LLM judge is consistent in its own reasoning. It looks at two things: does the LLM prefer one answer over another in a simple comparison, and does it maintain that preference if you add more options? This is based on the idea that a rational decision-maker should have stable and logical preferences. They tested Sage on a variety of questions and found it aligns well with existing human-based evaluations, proving it's a reliable alternative.
Why it matters?
This work is important because it reveals that even the most advanced LLMs, like Gemini and GPT-5, aren't always consistent when judging other models. They found a tendency called 'situational preference,' where the LLM's choice changes depending on how the question is presented. The research also shows that training LLMs specifically to be judges, and using multiple judges or encouraging deeper reasoning, can improve consistency. Finally, it questions whether human judgments are truly the 'gold standard' for evaluation, as humans themselves can be inconsistent.
Abstract
LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human bias that undermines the assessment of reliability and imposes scalability constraints. To overcome these limitations, we introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without necessitating any human annotation. Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency (pair-wise preference stability) and global logical consistency (transitivity across a full set of preferences). We curate a dataset of 650 questions by combining structured benchmark problems with real-world user queries. Our experiments demonstrate both the stability of our metrics and their high correlation with supervised benchmarks like LLMBar and RewardBench2, confirming Sage's reliability as an evaluation suite for the robustness and accuracy of LLM-as-a-Judge. Based on Sage, we reveal that current state-of-the-art LLMs exhibit significant reliability problems when acting as judges in both scoring and pairwise settings; even the top-performing models, Gemini-2.5-Pro and GPT-5, fail to maintain consistent preferences in nearly a quarter of difficult cases. We attribute this to a new phenomenon called situational preference, which explains why explicit rubrics or criteria can help the model judge consistently across answer pairs. Our further analysis shows that finetuned LLM-as-a-Judge is a feasible method to boost performance, and the panel-based judge as well as deep reasoning can enhance the judging consistency. We also find substantial inconsistency in human judgments, which indicates that human annotation may not be a reliable gold standard.