Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark
Xinxin Liu, Zhaopan Xu, Kai Wang, Yong Jae Lee, Yuzhang Shang
2025-11-19
Summary
This paper explores how well large language models, specifically when used with video generation, can actually *reason* about the world, not just create realistic-looking videos.
What's the problem?
Current methods for testing video generation models focus on how good the videos *look* or how well they follow instructions, but they don't test if the model understands the underlying physics or logic of what's happening in the video. Essentially, we can make videos that seem smart, but we don't know if the AI is actually thinking or just mimicking patterns. There's no good way to measure a video model's ability to plan, solve problems, or understand abstract concepts visually.
What's the solution?
The researchers created a new benchmark called Gen-ViRe. This benchmark breaks down 'reasoning' into six different cognitive skills, like understanding how objects interact or planning a series of actions. It includes 24 specific tasks designed to test these skills, and uses a combination of automated evaluation and human review to assess how well the models perform. They tested several state-of-the-art video models using this benchmark.
Why it matters?
This work is important because it provides a way to actually measure how well video models can reason, which is crucial for building AI that can truly understand and interact with the real world. The results show that even models that create visually impressive videos often struggle with basic reasoning tasks, highlighting areas where further research is needed to create more intelligent and capable AI systems.
Abstract
While Chain-of-Thought (CoT) prompting enables sophisticated symbolic reasoning in LLMs, it remains confined to discrete text and cannot simulate the continuous, physics-governed dynamics of the real world. Recent video generation models have emerged as potential world simulators through Chain-of-Frames (CoF) reasoning -- materializing thought as frame-by-frame visual sequences, with each frame representing a physically-grounded reasoning step. Despite compelling demonstrations, a challenge persists: existing benchmarks, focusing on fidelity or alignment, do not assess CoF reasoning and thus cannot measure core cognitive abilities in multi-step planning, algorithmic logic, or abstract pattern extrapolation. This evaluation void prevents systematic understanding of model capabilities and principled guidance for improvement. We introduce Gen-ViRe (Generative Visual Reasoning Benchmark), a framework grounded in cognitive science and real-world AI applications, which decomposes CoF reasoning into six cognitive dimensions -- from perceptual logic to abstract planning -- and 24 subtasks. Through multi-source data curation, minimal prompting protocols, and hybrid VLM-assisted evaluation with detailed criteria, Gen-ViRe delivers the first quantitative assessment of video models as reasoners. Our experiments on SOTA systems reveal substantial discrepancies between impressive visual quality and actual reasoning depth, establishing baselines and diagnostic tools to advance genuine world simulators.