R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, Cho-Jui Hsieh

2025-03-10

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

Summary

This paper talks about a breakthrough in making AI models better at visual reasoning tasks using a technique called reinforcement learning, achieving an 'aha moment' where the AI starts to show self-reflection and longer responses

What's the problem?

While this method worked well for text-based AI models, it was hard to replicate the same success for models that work with both text and images (multimodal models). Previous attempts often failed to achieve the same level of improvement in reasoning abilities

What's the solution?

The researchers used a smaller, 2 billion parameter AI model that hadn't been specially trained for instructions (non-SFT) and applied reinforcement learning using SAT test questions. This approach led to a significant improvement in the model's performance on visual reasoning tasks, outperforming the original model by about 30% and even slightly surpassing models that had been trained with more traditional methods

Why it matters?

This matters because it shows we can make AI models better at understanding and reasoning about images without needing huge amounts of data or extremely large models. It could lead to more efficient and capable AI systems for tasks that involve both text and images, like describing complex scenes or solving visual puzzles. The researchers also shared their failed attempts, which helps other scientists understand the challenges and potentially find new solutions

Abstract

Recently DeepSeek R1 demonstrated how reinforcement learning with simple rule-based incentives can enable autonomous development of complex reasoning in large language models, characterized by the "aha moment", in which the model manifest self-reflection and increased response length during training. However, attempts to extend this success to multimodal reasoning often failed to reproduce these key characteristics. In this report, we present the first successful replication of these emergent characteristics for multimodal reasoning on only a non-SFT 2B model. Starting with Qwen2-VL-2B and applying reinforcement learning directly on the SAT dataset, our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately ~30% and exceeding both SFT setting by ~2%. In addition, we share our failed attempts and insights in attempting to achieve R1-like reasoning using RL with instruct models. aiming to shed light on the challenges involved. Our key observations include: (1) applying RL on instruct model often results in trivial reasoning trajectories, and (2) naive length reward are ineffective in eliciting reasoning capabilities. The project code is available at https://github.com/turningpoint-ai/VisualThinker-R1-Zero

View Paper