Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning

Junhao Shen, Haiteng Zhao, Yuzhe Gu, Songyang Gao, Kuikun Liu, Haian Huang, Jianfei Gao, Dahua Lin, Wenwei Zhang, Kai Chen

2025-07-23

Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking
Reasoning

Summary

This paper talks about SOPHIA, a new method that helps large AI models that work with both images and text to think more carefully and deeply before giving answers.

What's the problem?

Current models often try to answer quickly without enough slow, careful reasoning, and it's hard to train them to slow down their thinking because of how they learn from visual and language data.

What's the solution?

The researchers created a semi-off-policy reinforcement learning approach where the model combines fast visual understanding with slower, thoughtful reasoning. They reward the model based on how well it reasons over time and let it learn from these rewards to improve its slow thinking abilities.

Why it matters?

This matters because teaching AI to slow down and reason better helps it solve complicated tasks more accurately, making it smarter and more reliable when dealing with both images and language.

Abstract

SOPHIA, a semi-off-policy reinforcement learning approach, enhances large vision-language models with slow-thinking reasoning, improving performance on multimodal reasoning tasks.

View Paper