RISE-Video: Can Video Generators Decode Implicit World Rules?

Mingxin Liu, Shuran Ma, Shibei Meng, Xiangyu Zhao, Zicheng Zhang, Shaofeng Zhang, Zhihang Zhong, Peixian Chen, Haoyu Cao, Xing Sun, Haodong Duan, Xue Yang

2026-02-06

RISE-Video: Can Video Generators Decode Implicit World Rules?

Summary

This paper introduces a new way to test how well AI can create videos that aren't just visually appealing, but also make sense in the real world, following basic rules of physics and common sense.

What's the problem?

Current AI video generators are really good at making videos *look* realistic, but they often fail when it comes to understanding and correctly portraying how things actually work. For example, they might show objects floating in the air or behaving in ways that defy gravity. There wasn't a good benchmark to specifically test this 'reasoning' ability in video generation.

What's the solution?

The researchers created a benchmark called RISE-Video, which includes 467 video scenarios designed to test if AI understands things like how objects interact, spatial relationships, and even specific knowledge about the world. They also developed a way to automatically evaluate these videos using other AI models, mimicking how a human would judge if the video makes sense. They then tested 11 different state-of-the-art video generation models using this benchmark.

Why it matters?

This work is important because it highlights a major weakness in current AI video technology. Just making videos look good isn't enough; they need to be logically consistent and reflect our understanding of how the world operates. By identifying these shortcomings, the researchers hope to push the development of AI that can create truly realistic and believable videos, which is crucial for applications like training simulations or creating virtual worlds.

Abstract

While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: Reasoning Alignment, Temporal Consistency, Physical Rationality, and Visual Quality. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.

View Paper