Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

Jiaqi Wang, Kevin Qinghong Lin, James Cheng, Mike Zheng Shou

2025-05-23

Think or Not? Selective Reasoning via Reinforcement Learning for
Vision-Language Models

Summary

This paper talks about a new way to train computer models that can understand both pictures and words, helping them decide when they really need to think hard and when they don't.

What's the problem?

The problem is that current vision-language models often do too much thinking or reasoning for every question, even when it's not needed, which wastes time and computer power.

What's the solution?

To solve this, the researchers created a two-step training method called TON. First, they use supervised fine-tuning with something called thought dropout to help the model learn when to skip unnecessary reasoning. Then, they use a special reinforcement learning technique called Group Relative Policy Optimization to further teach the model to only reason when it's actually helpful.

Why it matters?

This matters because it makes these smart models faster and more efficient without making them less accurate, which means they can be used more easily in real-world situations where speed and resources are important.

Abstract

TON, a two-stage training strategy combining supervised fine-tuning with thought dropout and Group Relative Policy Optimization, reduces unnecessary reasoning steps in vision-language models without sacrificing performance.

View Paper