Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning

Zhiheng Xi, Jixuan Huang, Xin Guo, Boyang Hong, Dingwen Yang, Xiaoran Fan, Shuo Li, Zehui Chen, Junjie Ye, Siyu Yuan, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang

2025-10-29

Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning

Summary

This paper introduces a new method, Critique-RL, for teaching language models to not only *generate* answers, but also to *critique* answers – essentially, to give feedback like a teacher would. It aims to improve the reasoning abilities of these models.

What's the problem?

Currently, training models to give good critiques requires having really smart 'teachers' (stronger language models) to provide the initial feedback data. This is a problem because it’s hard and expensive to get those high-quality teacher models. The paper found that simply letting a model learn to critique based on whether its suggestions improve the answer isn't enough; the model gets good at *sounding* helpful, but doesn't actually get better at identifying truly good versus bad answers.

What's the solution?

Critique-RL uses a two-step process. First, it directly teaches the 'critic' model to distinguish between good and bad answers using clear rules. Then, it lets the critic learn from how its feedback helps the 'actor' model improve, but while also making sure it doesn't lose its ability to tell good answers from bad ones. This is done through a careful balancing act of rewards and checks during the learning process.

Why it matters?

This research is important because it allows us to train models to critique each other without needing a super-intelligent teacher model. This makes it more practical and affordable to improve the reasoning skills of language models, leading to better performance on complex tasks and making them more adaptable to new situations.

Abstract

Training critiquing language models to assess and provide feedback on model outputs is a promising way to improve LLMs for complex reasoning tasks. However, existing approaches typically rely on stronger supervisors for annotating critique data. To address this, we propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision. Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly. We first reveal that relying solely on indirect reward signals from the actor's outputs for RL optimization often leads to unsatisfactory critics: while their helpfulness (i.e., providing constructive feedback) improves, the discriminability (i.e., determining whether a response is high-quality or not) remains poor, resulting in marginal performance gains. To overcome this, Critique-RL adopts a two-stage optimization strategy. In stage I, it reinforces the discriminability of the critic with direct rule-based reward signals; in stage II, it introduces indirect rewards based on actor refinement to improve the critic's helpfulness, while maintaining its discriminability via appropriate regularization. Extensive experiments across various tasks and models show that Critique-RL delivers substantial performance improvements. For example, it achieves a 9.02% gain on in-domain tasks and a 5.70% gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.

View Paper