CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards
Taolin Zhang, Maosong Cao, Alexander Lam, Songyang Zhang, Kai Chen
2025-07-15
Summary
This paper talks about CompassJudger-2, a model designed to act like a judge that can evaluate large language models across many different tasks and topics.
What's the problem?
The problem is that existing judge models are often too specialized or not strong enough to judge well in many different areas, which limits their usefulness for evaluating AI performance in a fair and accurate way.
What's the solution?
To fix this, CompassJudger-2 uses a method that gathers and creates a wide range of training data tailored to different judging tasks. It trains the model with a special learning technique that uses verifiable rewards to guide its judgment, along with mechanisms that improve its critical thinking in a general way. This makes it more accurate and robust across different domains.
Why it matters?
This is important because it helps create a more reliable and flexible AI judge that can fairly evaluate other AI models in many contexts. It sets new standards for judging AI, offering better and consistent evaluation performance even when compared to larger, more complex models.
Abstract
CompassJudger-2, a generalist judge model, improves cross-domain evaluation accuracy and robustness through task-driven data curation and a refined learning objective.