SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning

Fangxun Shu, Yongjie Ye, Yue Liao, Zijian Kang, Weijie Yin, Jiacong Wang, Xiao Liang, Shuicheng Yan, Chao Feng

2025-11-07

SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning

Summary

This paper introduces SAIL-RL, a new method for improving how well large AI models that can understand both text and images (multimodal large language models) can reason and solve problems.

What's the problem?

Current AI models are often trained by simply telling them if their final answer is right or wrong. This doesn't teach them *how* to think through a problem, and they can easily make mistakes even if they get the right answer by chance. Also, these models tend to use the same thinking strategy for every problem, which isn't efficient – sometimes a problem needs a lot of careful thought, and sometimes it's simple and doesn't.

What's the solution?

SAIL-RL uses a two-part reward system to train the AI. First, it rewards the *quality* of the AI’s reasoning process, checking if its steps are logical, factually correct, and lead to a consistent answer. Second, it rewards the AI for deciding *when* to think deeply and when to answer directly, adapting its strategy to the complexity of the problem. They tested this on a model called SAIL-VL2.

Why it matters?

This research is important because it makes AI models more reliable and trustworthy. By teaching them to reason properly and to adjust their thinking, SAIL-RL significantly reduces errors and hallucinations (making things up), allowing these models to perform as well as, or even better than, some of the best commercially available AI systems like GPT-4o.

Abstract

We introduce SAIL-RL, a reinforcement learning (RL) post-training framework that enhances the reasoning capabilities of multimodal large language models (MLLMs) by teaching them when and how to think. Existing approaches are limited by outcome-only supervision, which rewards correct answers without ensuring sound reasoning, and by uniform thinking strategies, which often lead to overthinking on simple tasks and underthinking on complex ones. SAIL-RL addresses these challenges with a dual reward system: the Thinking Reward, which evaluates reasoning quality through factual grounding, logical coherence, and answer consistency, and the Judging Reward, which adaptively determines whether deep reasoning or direct answering is appropriate. Experiments on the state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal understanding benchmarks at both 4B and 8B scales, achieving competitive performance against commercial closed-source models such as GPT-4o, and substantially reduces hallucinations, establishing it as a principled framework for building more reliable and adaptive MLLMs. The code will be available at https://github.com/BytedanceDouyinContent/SAIL-RL.

View Paper