GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, Xihui Liu
2025-06-25
Summary
This paper talks about GRPO-CARE, a new reinforcement learning method that helps multimodal large language models better understand videos by improving both the correctness of answers and the logical consistency of their reasoning.
What's the problem?
The problem is that earlier reinforcement learning methods focused mostly on getting the right answers, but the steps of reasoning leading to those answers were often inconsistent or illogical, especially when models had to analyze complex videos.
What's the solution?
The researchers designed GRPO-CARE to use two types of rewards during training: one for correct answers and another adaptive bonus that encourages logical and consistent reasoning steps. They also created a new benchmark called SEED-Bench-R1 to test the model's abilities in challenging video understanding tasks.
Why it matters?
This matters because it helps build AI systems that don’t just give the right answers but also explain their thinking clearly and sensibly, making them more trustworthy and useful for understanding complex, real-world information like videos.
Abstract
GRPO-CARE, a reinforcement learning framework optimizing for consistency and correctness, outperforms standard GRPO on a new video understanding benchmark, SEED-Bench-R1, improving both performance and logical coherence in multimodal large language models.