Neither Valid nor Reliable? Investigating the Use of LLMs as Judges
Khaoula Chehbouni, Mohammed Haddou, Jackie Chi Kit Cheung, Golnoosh Farnadi
2025-08-26

Summary
This paper questions whether we're getting too excited about using AI, specifically large language models, to automatically judge the quality of other AI-generated text. It argues that while using AI as a judge seems promising, we haven't thoroughly checked if it's actually a reliable way to evaluate how good these systems are.
What's the problem?
Evaluating how well AI writes is really hard. Traditionally, humans do it, but that's slow and expensive. New AI models are so advanced that they *could* potentially act as judges themselves, but the paper points out that we've started relying on them without fully understanding if their judgments align with what humans would think. The core issue is a lack of careful testing to ensure these AI judges are actually trustworthy and consistent.
What's the solution?
The authors don't propose a new method, but instead analyze the *assumptions* we're making when we use AI as a judge. They look at four key ideas: that AI judges can accurately mimic human judgment, that they're good at actually evaluating text, that they can handle large amounts of text easily, and that they save money. They then examine how these assumptions might be flawed, using examples like summarizing text, labeling data, and making sure AI is safe. They draw on ideas from social science research about how we measure things to highlight potential problems.
Why it matters?
This paper is important because if we rely on faulty AI judges, we might think AI writing is better than it actually is, or we might miss important flaws. This could slow down real progress in improving AI writing. The authors are calling for more careful and responsible testing of these AI judges to make sure they're actually helping us build better and more reliable language models.
Abstract
Evaluating natural language generation (NLG) systems remains a core challenge of natural language processing (NLP), further complicated by the rise of large language models (LLMs) that aims to be general-purpose. Recently, large language models as judges (LLJs) have emerged as a promising alternative to traditional metrics, but their validity remains underexplored. This position paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators. Drawing on measurement theory from the social sciences, we identify and critically assess four core assumptions underlying the use of LLJs: their ability to act as proxies for human judgment, their capabilities as evaluators, their scalability, and their cost-effectiveness. We examine how each of these assumptions may be challenged by the inherent limitations of LLMs, LLJs, or current practices in NLG evaluation. To ground our analysis, we explore three applications of LLJs: text summarization, data annotation, and safety alignment. Finally, we highlight the need for more responsible evaluation practices in LLJs evaluation, to ensure that their growing role in the field supports, rather than undermines, progress in NLG.