Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization
Yifan Du, Kun Zhou, Yingqian Min, Yue Ling, Wayne Xin Zhao, Youbin Wu
2025-12-03
Summary
This research investigates how different ways of providing step-by-step reasoning examples to visual AI models affect their ability to solve new, unseen problems involving visual information.
What's the problem?
Vision-language models are getting better at tasks that require reasoning, often by being shown examples of how to think through a problem – this is called 'Chain-of-Thought' or CoT. However, it's not really understood *why* certain CoT methods work better than others, or if they actually help the model learn to generally reason, or just memorize specific examples. The researchers wanted to figure out if longer, more detailed reasoning steps, or even steps that include images, truly lead to better problem-solving skills.
What's the solution?
The researchers created a maze-solving game where the rules are entirely visual, and the difficulty can be easily adjusted. They used a specific AI model and trained it using three different CoT approaches: one using just text explanations, one showing the path taken with coordinates, and one showing image manipulations of the maze. They then compared how well each approach helped the model learn to solve mazes of different sizes. They found that while longer and more visual CoT methods helped the model learn *faster*, they didn't improve its ultimate ability to solve the mazes. Surprisingly, the *shortest* and most focused reasoning steps – just the essential path information – actually led to the best performance on new, larger mazes. They confirmed these findings on other visual tasks as well.
Why it matters?
This work shows that when teaching AI to reason visually, less can actually be more. It challenges the idea that more detailed explanations are always better and suggests that focusing on the core, essential reasoning steps is key to building AI that can truly generalize and solve new problems. This provides practical advice for creating better training data for visual reasoning AI.
Abstract
We study how different Chain-of-Thought (CoT) designs affect the acquisition of the generalizable visual reasoning ability in vision-language models (VLMs). While CoT data, especially long or visual CoT such as "think with image", has been widely used to supervise intermediate reasoning, it remains unclear why specific CoT designs help and which ones truly support generalizable reasoning. To systematically evaluate this, we focus on a controlled maze-solving benchmark where reasoning rules are fully visual, difficulty can be tuned by grid size, and all the intermediate steps can be automatically generated. Using Qwen2.5-VL-7B under a standard SFT-then-RL pipeline, we compare three representative CoT formats: Language CoT, Grounding CoT (with spatial coordinate trajectories), and Visual CoT (with image manipulations). Our experiments reveal that visual and longer CoT mainly accelerate convergence but do not lift the final performance ceiling; concise CoT containing only essential grounding steps outperforms longer traces; and, strikingly, CoT retaining only the minimal grounding results generalizes best across different maze sizes. We further validate these insights on other vision-centric tasks. These findings highlight a "short is long" effect and provide practical guidance for constructing more generalizable SFT datasets for visual reasoning.