Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, Xinchi Chen, Jun Zhao, Xuanjing Huang, Xipeng Qiu
2025-11-07
Summary
This paper introduces a new way to help AI models, specifically large language models and vision-language models, reason better by letting them 'think with video'. Current methods use text and images separately, but this research shows that using videos, generated by models like Sora-2, can lead to more unified and powerful understanding.
What's the problem?
Existing AI reasoning methods, like 'Thinking with Text' and 'Thinking with Images', have limitations. Images only show a single moment in time, making it hard to understand processes that change over time. Also, treating text and images as separate things prevents the AI from truly understanding how they connect and work together, hindering its ability to both understand and create content involving both.
What's the solution?
The researchers developed a new approach called 'Thinking with Video'. They used a video generation model, Sora-2, to create videos that help the AI reason through problems. To test this, they created a benchmark called VideoThinkBench, which includes tasks that focus on either visual understanding (like solving puzzles by looking at them) or text-based reasoning (like math problems). They then tested Sora-2 on these tasks and found it performed well, sometimes even better than existing AI models.
Why it matters?
This research is important because it suggests that video could be a key to building AI that can truly understand the world around it. By allowing AI to 'think with video', we can move beyond separate understanding of text and images towards a more unified and powerful form of artificial intelligence capable of complex reasoning and content creation.
Abstract
"Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions "thinking with video" as a unified multimodal reasoning paradigm.