RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence

Xuming He, Zehao Fan, Hengjia Li, Fan Zhuo, Hankun Xu, Senlin Cheng, Di Weng, Haifeng Liu, Can Ye, Boxi Wu

2025-12-03

RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence

Summary

This paper introduces a new way to test how well video generation models can actually *think* and follow rules, not just create visually appealing videos.

What's the problem?

Current methods for evaluating video generation focus on how good the videos *look* – things like if they’re pretty, if they follow instructions, and if the action flows smoothly. However, they don’t really check if the models understand basic rules about how the world works and can apply that understanding when creating videos. We don't have a good way to break down and measure a video model's reasoning skills.

What's the solution?

The researchers created a benchmark called RULER-Bench. This benchmark gives video generation models tasks that require them to use different kinds of reasoning, like understanding cause and effect or spatial relationships. It works with both text prompts and starting images to create videos, and includes over 600 different scenarios. They used a powerful AI, GPT-o3, to score the videos based on how well they followed the rules, and this AI’s scoring closely matched what humans thought. They tested current state-of-the-art models and found they still struggle with reasoning, only getting about 49% on a key measure of rule following.

Why it matters?

This work is important because it highlights a major weakness in current video generation models: they can create impressive visuals, but they don’t necessarily *understand* what they’re creating. By providing a way to specifically measure reasoning ability, this research can help developers build future video models that are more intelligent and capable of truly understanding and representing the world.

Abstract

Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation models. To evaluate these video generation models, existing benchmarks primarily focus on factors related to visual perception and understanding, like visual aesthetics, instruction adherence, and temporal coherence. However, the rule-based reasoning capabilities of video generation models remain largely unexplored. Although recent studies have carried out preliminary explorations into whether video models can serve as zero-shot learners, they still lack a fine-grained decomposition of reasoning capabilities and a comprehensive evaluation protocol. To address this gap, we introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules. Built upon two fundamental paradigms: text-to-video and image-to-video, RULER-Bench covers 40 representative tasks spanning six rule categories with 622 high-quality annotated instances. For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question, achieving 85% alignment with human judgements. Extensive experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric, highlighting significant room for improvement in the reasoning capability of next-level video models. We expect that the insight obtained from RULER-Bench will facilitate further development of reasoning-aware video generation, advancing video generation models toward vision foundation intelligence.

View Paper