Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
Muyang He, Hanzhong Guo, Junxiong Lin, Yizhou Yu
2026-03-31
Summary
This paper is about making video generation models, which are getting really good at creating realistic moving images, more efficient so they can actually be used to simulate real-world scenarios.
What's the problem?
While video generation is improving rapidly and showing potential to act like a 'world simulator', it requires a huge amount of computing power to model how things move and change over time. This high cost makes it difficult to use these models for practical applications that need quick responses, like self-driving cars or video games.
What's the solution?
The researchers looked at all the different ways people are trying to make video generation faster and less resource-intensive. They organized these methods into three main categories: how the models represent movement, how the models are built internally, and how quickly the models can produce results. They essentially created a guide to the most promising efficiency techniques.
Why it matters?
Making video generation more efficient isn't just about saving money on computers. It's crucial for unlocking the potential of these models to be used in real-time applications that interact with the world, like training robots, creating realistic game environments, or developing safer autonomous vehicles. Ultimately, efficiency is key to turning these models into truly useful 'digital worlds'.
Abstract
The rapid evolution of video generation has enabled models to simulate complex physical dynamics and long-horizon causalities, positioning them as potential world simulators. However, a critical gap still remains between the theoretical capacity for world simulation and the heavy computational costs of spatiotemporal modeling. To address this, we comprehensively and systematically review video generation frameworks and techniques that consider efficiency as a crucial requirement for practical world modeling. We introduce a novel taxonomy in three dimensions: efficient modeling paradigms, efficient network architectures, and efficient inference algorithms. We further show that bridging this efficiency gap directly empowers interactive applications such as autonomous driving, embodied AI, and game simulation. Finally, we identify emerging research frontiers in efficient video-based world modeling, arguing that efficiency is a fundamental prerequisite for evolving video generators into general-purpose, real-time, and robust world simulators.