V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

Yang Luo, Xuanlei Zhao, Baijiong Lin, Lingting Zhu, Liyao Tang, Yuqi Liu, Ying-Cong Chen, Shengju Qian, Xin Wang, Yang You

2025-11-21

V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

Summary

This paper introduces a new way to test how well AI models understand and reason about videos, especially those that *create* videos like Veo-3.

What's the problem?

As AI gets better at making videos, it's becoming clear they can sometimes seem to 'think' or understand things, but we don't have good, consistent tests to check if this is real reasoning or just a lucky guess. Existing tests aren't comprehensive enough to really dig into *how* these models are processing video information.

What's the solution?

The researchers created a benchmark called V-ReasonBench. This benchmark gives AI models a series of video-based challenges that test four specific reasoning skills: solving problems step-by-step, understanding how things are positioned in space, recognizing patterns, and understanding how objects move and interact physically. They used both computer-generated and real-life videos to make the tests diverse and reliable, and they tested six different state-of-the-art video AI models.

Why it matters?

Having a reliable benchmark like V-ReasonBench is crucial for improving AI. It helps developers pinpoint exactly where their models struggle with reasoning, allowing them to build AI that's more trustworthy and behaves more like a human would when understanding the world through video.

Abstract

Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.

View Paper