TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

Harold Haodong Chen, Disen Lan, Wen-Jie Shu, Qingyang Liu, Zihan Wang, Sirui Chen, Wenkai Cheng, Kanghao Chen, Hongfei Zhang, Zixin Zhang, Rongjin Guo, Yu Cheng, Ying-Cong Chen

2025-11-18

TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

Summary

This paper introduces a new way to test how well AI models can *think* when creating videos, going beyond just making videos that *look* good.

What's the problem?

Current methods for judging video generation AI focus on how realistic and smooth the videos are, but they don't really check if the AI understands what it's showing. For example, can it understand cause and effect, or follow complex instructions? Existing tests don't push these models to demonstrate true reasoning skills like we see in large language models (like ChatGPT).

What's the solution?

The researchers created a benchmark called TiViBench, which is a series of challenges designed to specifically test different kinds of reasoning in video generation – things like understanding spatial relationships, logic, and planning actions. They tested current AI models like Sora and Veo with this benchmark. They also developed a technique called VideoTPO, which uses an AI to analyze its own video creations and improve them based on what it identifies as weaknesses, all without needing extra training data.

Why it matters?

This work is important because it provides a better way to measure and improve the intelligence of video-generating AI. By focusing on reasoning abilities, we can move beyond just visually appealing videos and create AI that can truly understand and depict the world in a logical and consistent way, opening up possibilities for more useful and creative applications.

Abstract

The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.

View Paper