CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

Maosong Cao, Alexander Lam, Haodong Duan, Hongwei Liu, Songyang Zhang, Kai Chen

2024-10-22

CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

Summary

This paper introduces CompassJudger-1, an all-in-one judge model designed to evaluate and improve large language models (LLMs) efficiently and accurately.

What's the problem?

Evaluating the performance of language models is crucial for their improvement, but traditional methods often rely on human judges, which can be expensive and inconsistent. Automated evaluation methods are needed to provide reliable assessments that align with real-world usage, but existing automated judges may not be effective or versatile enough.

What's the solution?

To address this problem, the authors developed CompassJudger-1, an open-source model that can perform various evaluation tasks. It can score models, compare two models, generate critiques, and handle different evaluation formats. Additionally, they created JudgerBench, a benchmark that allows for standardized testing of different judge models across a variety of subjective evaluation tasks. This combination makes it easier to assess how well language models perform and where they need improvement.

Why it matters?

This research is significant because it provides a comprehensive tool for evaluating language models, which can help researchers and developers enhance their AI systems. By making CompassJudger-1 and JudgerBench available to the public, the authors aim to foster collaboration in the AI community and accelerate advancements in evaluation methodologies, ultimately leading to better-performing language models.

Abstract

Efficient and accurate evaluation is crucial for the continuous improvement of large language models (LLMs). Among various assessment methods, subjective evaluation has garnered significant attention due to its superior alignment with real-world usage scenarios and human preferences. However, human-based evaluations are costly and lack reproducibility, making precise automated evaluators (judgers) vital in this process. In this report, we introduce CompassJudger-1, the first open-source all-in-one judge LLM. CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility. It is capable of: 1. Performing unitary scoring and two-model comparisons as a reward model; 2. Conducting evaluations according to specified formats; 3. Generating critiques; 4. Executing diverse tasks like a general LLM. To assess the evaluation capabilities of different judge models under a unified setting, we have also established JudgerBench, a new benchmark that encompasses various subjective evaluation tasks and covers a wide range of topics. CompassJudger-1 offers a comprehensive solution for various evaluation tasks while maintaining the flexibility to adapt to diverse requirements. Both CompassJudger and JudgerBench are released and available to the research community athttps://github.com/open-compass/CompassJudger. We believe that by open-sourcing these tools, we can foster collaboration and accelerate progress in LLM evaluation methodologies.

View Paper