JuStRank: Benchmarking LLM Judges for System Ranking
Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar-Haim, Lilach Eden, Asaf Yehudai
2024-12-13

Summary
This paper introduces JuStRank, a new benchmark designed to evaluate how well large language models (LLMs) can rank different AI systems based on their outputs.
What's the problem?
As generative AI technology rapidly advances, there are many different models and configurations available. However, it's challenging to systematically compare these models to determine which ones perform best. Previous methods for evaluating LLM judges didn't consider important factors like biases toward certain systems, which can affect their rankings.
What's the solution?
JuStRank addresses this issue by conducting a large-scale study that assesses LLM judges as system rankers. It generates system scores by combining the judgments from LLMs over multiple outputs and compares these scores to human rankings. This approach allows for a more accurate evaluation of how well LLM judges can rank different systems while also analyzing their behavior, including any biases they may have.
Why it matters?
This research is significant because it helps standardize the evaluation of AI systems, ensuring that the judges used in these assessments are reliable and fair. By understanding how LLM judges perform and where they might struggle, researchers can improve these models and make better decisions about which AI systems to use in various applications.
Abstract
Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge's positive or negative bias towards certain systems. To address this gap, we conduct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge's quality is assessed by comparing the resulting system ranking to a human-based ranking. Beyond overall judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.