Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale
David Acuna, Chao-Han Huck Yang, Yuntian Deng, Jaehun Jung, Ximing Lu, Prithviraj Ammanabrolu, Hyunwoo Kim, Yuan-Hong Liao, Yejin Choi
2025-11-11
Summary
This paper focuses on creating a large, new dataset to help computers get better at reasoning using images, going beyond just solving math problems you see in pictures. It also explores the best ways to train these image-understanding AI systems.
What's the problem?
Currently, a lot of progress in teaching AI to reason with images relies on datasets and methods that aren't publicly available. This makes it hard for researchers to understand *how* to build good reasoning datasets, especially for complex tasks that aren't simply visual math problems. There's a need for a transparent and scalable way to create these datasets.
What's the solution?
The researchers developed a two-step process to generate over a million image-based reasoning questions. First, they focused on creating a large quantity of questions, then they increased the difficulty and complexity of those questions. They used powerful AI models – both those that understand images and those that handle language reasoning – to create detailed step-by-step explanations (like showing your work in math) for each question. They then used this dataset to train an AI model called Qwen2.5-VL-7B.
Why it matters?
This work is important because the trained AI model, Qwen2.5-VL-7B, performed better than other publicly available models and even rivaled some that were trained on secret datasets. Surprisingly, the skills learned from reasoning about images also helped the AI perform better on tasks involving only text or audio. The research also provides insights into the best practices for training these AI systems, like the importance of high-quality training data and a staged approach to learning.
Abstract
Recent progress in multimodal reasoning has been driven largely by undisclosed datasets and proprietary data synthesis recipes, leaving open questions about how to systematically build large-scale, vision-centric reasoning datasets, particularly for tasks that go beyond visual math. In this work, we introduce a new reasoning data generation framework spanning diverse skills and levels of complexity with over 1M high-quality synthetic vision-centric questions. The dataset also includes preference data and instruction prompts supporting both offline and online RL. Our synthesis framework proceeds in two stages: (1) scale; and (2) complexity. Reasoning traces are then synthesized through a two-stage process that leverages VLMs and reasoning LLMs, producing CoT traces for VLMs that capture the richness and diverse cognitive behaviors found in frontier reasoning models. Remarkably, we show that finetuning Qwen2.5-VL-7B on our data outperforms all open-data baselines across all evaluated vision-centric benchmarks, and even surpasses strong closed-data models such as MiMo-VL-7B-RL on V* Bench, CV-Bench and MMStar-V. Perhaps most surprising, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro) and audio reasoning (MMAU), demonstrating its effectiveness. Similarly, despite not containing videos or embodied visual data, we observe notable gains when evaluating on a single-evidence embodied QA benchmark (NiEH). Finally, we use our data to analyze the entire VLM post-training pipeline. Our empirical analysis highlights that (i) SFT on high-quality data with non-linear reasoning traces is essential for effective online RL, (ii) staged offline RL matches online RL's performance while reducing compute demands, and (iii) careful SFT on high quality data can substantially improve out-of-domain, cross-modality transfer.