LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model
Xiyao Wang, Chunyuan Li, Jianwei Yang, Kai Zhang, Bo Liu, Tianyi Xiong, Furong Huang
2025-09-03
Summary
This research challenges the common practice in vision-language models of keeping the parts that judge answers (critics) separate from the parts that create answers (policies). It shows that you can actually train a single model to do both well, leading to better performance overall.
What's the problem?
Typically, vision-language models have two distinct components: a 'policy' that generates responses to images and text, and a 'critic' that evaluates how good those responses are. The critic is only used for judging, never for actually making the responses itself. This limits potential improvements because the critic’s knowledge isn’t directly used to enhance the answer-generating process. Essentially, the system isn't learning from its own evaluations in a direct way.
What's the solution?
The researchers took datasets normally used to train critics – data showing which answers are preferred – and used them to directly train a base generative model using reinforcement learning. This created a model, called LLaVA-Critic-R1, that can both judge and generate responses. They then improved upon this model with LLaVA-Critic-R1+, and also showed that even using the improved critic to evaluate answers *during* testing can lead to better results without any further training.
Why it matters?
This work is important because it demonstrates that a single model can effectively handle both evaluation and generation in vision-language tasks. This simplifies the system, potentially making it easier to scale and improve. It also opens the door to creating models that can learn and refine themselves continuously, leading to more capable and self-improving AI systems that can better understand and reason about images and text.
Abstract
In vision-language modeling, critic models are typically trained to evaluate outputs -- assigning scalar scores or pairwise preferences -- rather than to generate responses. This separation from policy models, which produce the responses, is so entrenched that critics are rarely considered for direct policy use. In this work, we challenge this convention. We propose to reorganize preference-labeled critic datasets into verifiable training signals and perform reinforcement learning directly on a base generative model, producing LLaVA-Critic-R1, a multimodal critic trained to optimize preference judgments while retaining full generation ability. Surprisingly, LLaVA-Critic-R1 emerges not only as a top-performing critic but also as a competitive policy model -- matching or surpassing specialized reasoning VLMs trained with in-domain data across 26 visual reasoning and understanding benchmarks, with an average gain of +5.7% over its base model (Qwen-2.5-VL-7B). Extending this approach to existing strong reasoning VLMs yields LLaVA-Critic-R1+, which further advances policy performance without sacrificing critic quality, achieving a SoTA performance of 71.9 on MMMU at the 7B scale. Finally, we show that the enhanced critic ability benefits inference: applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks without additional training. Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation, offering a simple path toward scalable, self-improving multimodal systems.