R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
Jie Jiang, Qi Yang, Bolin Ni, Shiming Xiang, Han Hu, Houwen Peng
2025-09-01
Summary
This paper introduces a new approach to making AI models that can both 'think' through problems step-by-step and directly answer simple questions without unnecessary processing, called R-4B.
What's the problem?
Current AI models, specifically Multimodal Large Language Models, always try to reason step-by-step, even when a problem is simple and doesn't require that level of thought. This is inefficient and wastes computing power. Essentially, they're overthinking easy questions.
What's the solution?
The researchers created R-4B, a model that learns *when* to think and when to directly answer. They trained it using a special process called 'bi-mode annealing' and 'Bi-mode Policy Optimization' to get better at deciding whether to activate its step-by-step reasoning or not. The model was first trained on a variety of problems, then further refined to actively choose between thinking and not thinking for each new question it receives.
Why it matters?
R-4B is important because it makes these AI models more efficient and powerful. It performs as well as, or even better than, other models on difficult tasks, but uses less computing power. It even matches the performance of much larger models on complex reasoning problems, making advanced AI more accessible and practical.
Abstract
Multimodal Large Language Models (MLLMs) equipped with step-by-step thinking capabilities have demonstrated remarkable performance on complex reasoning problems. However, this thinking process is redundant for simple problems solvable without complex reasoning. To address this inefficiency, we propose R-4B, an auto-thinking MLLM, which can adaptively decide when to think based on problem complexity. The central idea of R-4B is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization~(BPO) to improve the model's accuracy in determining whether to activate the thinking process. Specifically, we first train the model on a carefully curated dataset spanning various topics, which contains samples from both thinking and non-thinking modes. Then it undergoes a second phase of training under an improved GRPO framework, where the policy model is forced to generate responses from both modes for each input query. Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks. It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.