The Trinity of Consistency as a Defining Principle for General World Models
Jingxuan Wei, Siyuan Li, Yuhang Xu, Zheng Sun, Junjie Jiang, Hexuan Jin, Caijun Jia, Honghao He, Xinglong Xu, Xi bai, Chang Yu, Yumou Liu, Junnan Zhu, Xuanhe Zhou, Jintao Chen, Xiaobin Hu, Shancheng Pang, Bihui Yu, Ran He, Zhen Lei, Stan Z. Li, Conghui He
2026-02-27
Summary
This paper explores the challenge of building 'World Models' in artificial intelligence – essentially, AI systems that can understand and predict how the physical world works, similar to how humans do. It looks at recent progress in AI like video generation and multimodal models, but argues there's a lack of a clear, fundamental understanding of what makes a truly general World Model possible.
What's the problem?
Currently, AI models are getting better at *imitating* the physical world, like creating realistic videos. However, they don't necessarily *understand* the underlying rules governing that world. There isn't a solid theoretical foundation defining what properties an AI needs to truly model and reason about reality in a general way. Existing systems are often built as separate pieces that don't work together seamlessly.
What's the solution?
The researchers propose that a good World Model needs three key types of consistency working together: 'Modal Consistency' which means understanding different types of information (like images and text) in relation to each other, 'Spatial Consistency' which means understanding how things are arranged in space, and 'Temporal Consistency' which means understanding cause and effect over time. They also created a new benchmark called 'CoW-Bench' to test how well current AI models perform on tasks requiring reasoning about videos and generating future frames, providing a standardized way to evaluate progress.
Why it matters?
This work is important because it provides a roadmap for building more intelligent AI. By identifying these three core consistencies, it gives researchers a clear set of goals to aim for. The new benchmark helps to objectively measure progress and identify areas where current AI systems are falling short, ultimately pushing the field closer to creating AI that can truly understand and interact with the world around it.
Abstract
The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Recent advancements represented by video generation models like Sora have demonstrated the potential of data-driven scaling laws to approximate physical dynamics, while the emerging Unified Multimodal Model (UMM) offers a promising architectural paradigm for integrating perception, language, and reasoning. Despite these advances, the field still lacks a principled theoretical framework that defines the essential properties requisite for a General World Model. In this paper, we propose that a World Model must be grounded in the Trinity of Consistency: Modal Consistency as the semantic interface, Spatial Consistency as the geometric basis, and Temporal Consistency as the causal engine. Through this tripartite lens, we systematically review the evolution of multimodal learning, revealing a trajectory from loosely coupled specialized modules toward unified architectures that enable the synergistic emergence of internal world simulators. To complement this conceptual framework, we introduce CoW-Bench, a benchmark centered on multi-frame reasoning and generation scenarios. CoW-Bench evaluates both video generation models and UMMs under a unified evaluation protocol. Our work establishes a principled pathway toward general world models, clarifying both the limitations of current systems and the architectural requirements for future progress.