MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning
Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, Wenqi Shao
2025-03-11
Summary
This paper talks about MM-Eureka, an AI tool that helps computers solve visual problems by learning from rewards and rethinking answers, like a student checking their work after getting feedback.
What's the problem?
AI struggles with tasks that mix images and text (like solving math problems with diagrams) because existing methods don’t learn efficiently or reflect on mistakes the way humans do.
What's the solution?
MM-Eureka uses a reward system to teach AI through trial and error, encouraging it to look back at images and improve answers over time without needing tons of labeled examples.
Why it matters?
This helps AI tackle real-world tasks like homework tutoring or medical imaging analysis more accurately and with less training data, saving time and resources.
Abstract
We present MM-Eureka, a multimodal reasoning model that successfully extends large-scale rule-based reinforcement learning (RL) to multimodal reasoning. While rule-based RL has shown remarkable success in improving LLMs' reasoning abilities in text domains, its application to multimodal settings has remained challenging. Our work reproduces key characteristics of text-based RL systems like DeepSeek-R1 in the multimodal space, including steady increases in accuracy reward and response length, and the emergence of reflection behaviors. We demonstrate that both instruction-tuned and pre-trained models can develop strong multimodal reasoning capabilities through rule-based RL without supervised fine-tuning, showing superior data efficiency compared to alternative approaches. We open-source our complete pipeline to foster further research in this area. We release all our codes, models, data, etc. at https://github.com/ModalMinds/MM-EUREKA