Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models

Pu Jian, Junhong Wu, Wei Sun, Chen Wang, Shuo Ren, Jiajun Zhang

2025-09-16

Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models

Summary

This paper focuses on improving how well vision-language models, which are AI systems that can 'look' at images and understand text, can actually *reason* about what they see. It builds on recent progress in getting text-based AI to think through problems step-by-step.

What's the problem?

Current vision-language models struggle with 'slow thinking' when dealing with images. When asked to explain their reasoning, they quickly stop paying attention to the actual visual details of the image, leading to inaccurate or poorly supported conclusions. Essentially, they don't consistently 'check their work' by looking back at the image as they explain their thought process.

What's the solution?

The researchers created a new model called Reflection-V. They trained it in two key ways. First, they used a system where a language model and a vision model worked together to create training data that specifically emphasized connecting reasoning steps to visual elements. Second, they used a reward system during training that encouraged the model to focus on relevant parts of the image while generating explanations. This helped the model learn to consistently refer back to the visual information.

Why it matters?

This work is important because it addresses a fundamental limitation of current vision-language models. By improving their ability to visually reflect on their reasoning, these models become more reliable and trustworthy, especially in tasks that require careful analysis of images, like understanding complex scenes or answering detailed questions about visual content.

Abstract

Recent advances in text-only "slow-thinking" reasoning have prompted efforts to transfer this capability to vision-language models (VLMs), for training visual reasoning models (VRMs). owever, such transfer faces critical challenges: Effective "slow thinking" in VRMs requires visual reflection, the ability to check the reasoning process based on visual information. Through quantitative analysis, we observe that current VRMs exhibit limited visual reflection, as their attention to visual information diminishes rapidly with longer generated responses. To address this challenge, we propose a new VRM Reflection-V, which enhances visual reflection based on reasoning data construction for cold-start and reward design for reinforcement learning (RL). Firstly, we construct vision-centered reasoning data by leveraging an agent that interacts between VLMs and reasoning LLMs, enabling cold-start learning of visual reflection patterns. Secondly, a visual attention based reward model is employed during RL to encourage reasoning based on visual information. Therefore, Reflection-V demonstrates significant improvements across multiple visual reasoning benchmarks. Furthermore, Reflection-V maintains a stronger and more consistent reliance on visual information during visual reasoning, indicating effective enhancement in visual reflection capabilities.

View Paper