VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu, Zechun Liu, Chenchen Zhu, Zhipeng Cai, Chong Zhou, Haozhe Liu, Ernie Chang, Saksham Suri, Hongyu Xu, Qi Qian, Wei Wen, Balakrishnan Varadarajan, Zhuang Liu, Hu Xu, Florian Bordes, Raghuraman Krishnamoorthi, Bernard Ghanem, Vikas Chandra
2026-01-09
Summary
This paper investigates whether large language models *always* need to show their work – that is, use 'chain-of-thought' reasoning – when understanding videos. It turns out, often they don't, and a smarter approach can be more accurate and faster.
What's the problem?
Researchers have been using a technique called 'chain-of-thought' where the model explains its reasoning step-by-step before giving an answer to questions about videos. However, it wasn't clear if this was actually *helping* the model, or if it was just making things more complicated and slower. The paper found that for some video models, directly answering the question worked just as well, or even better, than going through all the reasoning steps, even though reasoning takes more computing power.
What's the solution?
The researchers created a new framework called VideoAuto-R1. This system first tries to answer the question directly. Then, it *decides* if it needs to think through the problem step-by-step based on how confident it is in its initial answer. During training, the model practices answering, reasoning, and then reviewing its answer, getting feedback on both attempts. This 'reason-when-necessary' approach makes the model more efficient.
Why it matters?
This work is important because it shows that we don't always need to force AI models to explain their thinking. By letting the model decide when reasoning is truly needed, we can make video understanding systems faster and more accurate, using fewer resources. It suggests that complex reasoning is valuable, but not for every single task – sometimes a quick, direct answer is best.
Abstract
Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.