Teaching Language Models to Critique via Reinforcement Learning
Zhihui Xie, Jie chen, Liyu Chen, Weichao Mao, Jingjing Xu, Lingpeng Kong
2025-02-12
Summary
This paper talks about a method called CTRL, which teaches AI systems to critique and improve their own work, especially when writing computer code. The goal is to make these systems better at fixing mistakes and refining their outputs without needing human help.
What's the problem?
AI models often struggle to improve their own work because they lack the ability to give useful feedback on what went wrong or how to fix it. This limits their ability to get better over time, especially in complex tasks like coding.
What's the solution?
The researchers developed CTRL, a framework that trains an AI model to act as a critic. This critic reviews the AI's initial attempts at solving a problem and provides feedback to help it improve. They used reinforcement learning to train the critic, focusing on how well the feedback leads to better solutions. The system was tested on coding tasks and showed significant improvements in accuracy and the ability to fix errors through multiple rounds of critique and revision.
Why it matters?
This matters because it allows AI systems to become more self-sufficient and reliable, especially in tasks that require precision, like programming. By teaching AI to critique itself, we can create smarter systems that improve over time without constant human supervision, which could be useful in many fields beyond coding.
Abstract
Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide accurate judgments and actionable suggestions. In this work, we study LLM critics for code generation and propose CTRL, a framework for Critic Training via Reinforcement Learning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with CTRL significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models. Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision, achieving up to 106.1% relative improvements across challenging code generation benchmarks.