Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge

Qiyuan Zhang, Yufei Wang, Yuxin Jiang, Liangyou Li, Chuhan Wu, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu, Chen Ma

2025-02-19

Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for
LLM-as-a-Judge

Summary

This paper talks about a new way to make AI language models better at judging other AI outputs called Crowd-based Comparative Evaluation. It's like teaching an AI to be a more thorough judge by comparing answers to a group of other responses, instead of just looking at one answer on its own.

What's the problem?

The current way of using AI to judge other AI outputs, called LLM-as-a-Judge, doesn't always catch all the important details. It's like having a teacher who grades essays but sometimes misses important points because they're not looking closely enough. This can lead to incomplete or inaccurate judgments.

What's the solution?

The researchers created a method called Crowd-based Comparative Evaluation. This method shows the AI judge not just the answer it's supposed to grade, but also a bunch of other answers to the same question. By comparing these answers, the AI can spot more details and give a more thorough judgment. It's like giving the teacher grading essays a stack of other students' essays to compare, helping them notice more details in each one.

Why it matters?

This matters because it makes AI evaluation more reliable and accurate. Better evaluation means we can improve AI systems more effectively. The new method led to a 6.7% increase in accuracy across different tests. It also helps in training new AI models more efficiently. By having AI judges that can give more detailed and accurate feedback, we can create smarter and more reliable AI systems for various applications, from chatbots to complex problem-solving tools.

Abstract

LLM-as-a-Judge, which generates chain-of-thought (<PRE_TAG>CoT)</POST_TAG> judgments, has become a widely adopted auto-evaluation method. However, its reliability is compromised by the CoT reasoning's inability to capture comprehensive and deeper details, often leading to incomplete outcomes. Existing methods mainly rely on majority voting or criteria expansion, which is insufficient to address the limitation in CoT. We propose Crowd-based Comparative Evaluation, which introduces additional crowd responses to compare with the candidate responses, thereby exposing deeper and more comprehensive details within the candidate responses. This process effectively guides LLM-as-a-Judge to provide a more detailed <PRE_TAG>CoT judgment</POST_TAG>. Extensive experiments demonstrate that our approach enhances evaluation reliability, achieving an average accuracy gain of 6.7% across five benchmarks. Moreover, our method produces higher-quality CoTs that facilitate judge distillation and exhibit superior performance in rejection sampling for supervised fine-tuning (SFT), referred to as crowd rejection sampling, thereby enabling more efficient SFT. Our analysis confirms that CoTs generated by ours are more comprehensive and of higher quality, and evaluation accuracy improves as inference scales.

View Paper