Demystifing Video Reasoning
Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang
2026-03-18
Summary
This paper investigates how video generation models, specifically those using a technique called diffusion, can surprisingly perform reasoning tasks. It challenges the idea that this reasoning happens by thinking step-by-step *through* the frames of the video, and instead proposes a different explanation.
What's the problem?
Researchers noticed that video generation models could solve problems, but assumed this was because the models were essentially 'watching' a solution unfold frame by frame, like a chain of events. The problem was that no one really understood *how* this reasoning was happening within the model, and whether the 'frame-by-frame' explanation was actually correct. It was a black box.
What's the solution?
The researchers found that the reasoning actually happens during the *creation* of each frame, within the diffusion process itself. They discovered the model explores many possible solutions early on, then gradually refines them until it settles on a final answer – a 'chain of steps' within the denoising process. They also identified that the model uses something like a working memory to remember information, corrects its own mistakes, and first understands the scene before taking action. They also looked inside the model and found different parts specialize in different tasks, like understanding what things *are* versus *doing* something with them. Finally, they showed a simple way to improve reasoning by combining results from multiple runs of the same model.
Why it matters?
Understanding how reasoning emerges in these video models is important because it could unlock a new way to build intelligent systems. Instead of explicitly programming reasoning abilities, we might be able to leverage the natural reasoning capabilities that already exist within these models, potentially leading to more powerful and flexible AI.
Abstract
Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.