JudgeBench: A Benchmark for Evaluating LLM-based Judges

Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Y. Tang, Alejandro Cuadron, Chenguang Wang, Raluca Ada Popa, Ion Stoica

2024-10-18

JudgeBench: A Benchmark for Evaluating LLM-based Judges

Summary

This paper introduces JudgeBench, a new benchmark designed to evaluate how well large language model (LLM)-based judges can assess the correctness of responses in various challenging tasks.

What's the problem?

As LLMs become more advanced, they are increasingly used to evaluate other AI models. However, the reliability of these LLM-based judges is often overlooked. Current evaluation methods mainly focus on how well these judges align with human preferences, which can be misleading, especially for complex tasks that require factual and logical accuracy. This creates a need for a better way to assess the capabilities of LLM-based judges.

What's the solution?

To address this issue, the authors developed JudgeBench, which provides a standardized way to evaluate LLM-based judges on difficult response pairs that cover knowledge, reasoning, math, and coding tasks. JudgeBench uses a novel pipeline to convert existing datasets into challenging pairs of responses with labels that reflect objective correctness. The authors conducted extensive evaluations showing that many strong models performed only slightly better than random guessing on these tasks, highlighting the benchmark's difficulty and effectiveness.

Why it matters?

This research is significant because it offers a reliable platform for assessing the performance of LLM-based judges, which is crucial as AI systems become more complex. By focusing on factual and logical correctness rather than just stylistic preferences, JudgeBench helps ensure that AI evaluations are accurate and trustworthy. This advancement can lead to improved AI technologies that better serve users in various applications.

Abstract

LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at https://github.com/ScalerLab/JudgeBench .

View Paper