Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, Pheng-Ann Heng
2025-10-31
Summary
This paper investigates how well current video generation models, specifically a model called Veo-3, can 'think' and reason about what they see in videos without being specifically trained to do so. These models are really good at *making* realistic videos, and that suggests they might understand how the world works, but this study tests if that's actually true.
What's the problem?
While video models can create impressive videos, it's unclear if they truly *understand* the visual world or if they're just good at mimicking it. The question is whether these models can be used to solve complex visual problems, like predicting what will happen next or understanding how objects interact, without any prior training on those specific tasks. There wasn't a good way to systematically test this kind of reasoning ability in video models.
What's the solution?
Researchers thoroughly tested Veo-3's reasoning skills across 12 different areas, including understanding space, geometry, physics, time, and even how things work in the real world. To make the testing fair and repeatable, they created a new set of videos specifically designed for this purpose, called MME-CoF. They looked at both what the model got right and where it struggled, focusing on how well it could connect events across multiple frames in a video – what they call 'Chain-of-Frame' reasoning.
Why it matters?
The findings show that while these video models are getting better at understanding simple, short-term visual situations, they still have trouble with more complex reasoning, like predicting long-term consequences or dealing with precise geometric rules. This means they aren't ready to be used as independent problem-solvers yet, but they *could* be helpful tools when combined with other AI systems that are specifically designed for reasoning.
Abstract
Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io