JudgeLRM: Large Reasoning Models as a Judge

Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, Bingsheng He

2025-04-02

JudgeLRM: Large Reasoning Models as a Judge

Summary

This paper explores how to train AI models to be better at judging the quality of other AI models, especially in areas that require complex thinking.

What's the problem?

Current methods for training AI judges don't always work well when complex reasoning is needed to evaluate the AI being judged.

What's the solution?

The researchers developed a new method called JudgeLRM that uses reinforcement learning to train AI judges to focus on the outcome of their judgments, making them better at evaluating complex tasks.

Why it matters?

This work matters because it can lead to more reliable and accurate ways to evaluate AI models, which is important for developing better AI.

Abstract

The rise of Large Language Models (LLMs) as evaluators offers a scalable alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for judges approaches often fall short in domains requiring complex reasoning. In this work, we investigate whether LLM judges truly benefit from enhanced reasoning capabilities. Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples - highlighting the limitations of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks requiring deep reasoning.

View Paper