GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, Xihui Liu

2025-06-25

GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal
Reasoning

Summary

This paper talks about GRPO-CARE, a new reinforcement learning method that helps multimodal large language models better understand videos by improving both the correctness of answers and the logical consistency of their reasoning.

What's the problem?

The problem is that earlier reinforcement learning methods focused mostly on getting the right answers, but the steps of reasoning leading to those answers were often inconsistent or illogical, especially when models had to analyze complex videos.

What's the solution?

The researchers designed GRPO-CARE to use two types of rewards during training: one for correct answers and another adaptive bonus that encourages logical and consistent reasoning steps. They also created a new benchmark called SEED-Bench-R1 to test the model's abilities in challenging video understanding tasks.

Why it matters?

This matters because it helps build AI systems that don’t just give the right answers but also explain their thinking clearly and sensibly, making them more trustworthy and useful for understanding complex, real-world information like videos.

Abstract

GRPO-CARE, a reinforcement learning framework optimizing for consistency and correctness, outperforms standard GRPO on a new video understanding benchmark, SEED-Bench-R1, improving both performance and logical coherence in multimodal large language models.

View Paper