Simulating the Visual World with Artificial Intelligence: A Roadmap
Jingtong Yue, Ziqi Huang, Zhaoxi Chen, Xintao Wang, Pengfei Wan, Ziwei Liu
2025-11-17
Summary
This paper discusses how creating videos is changing from just making things *look* good to building complete, interactive virtual worlds. It looks at how new AI models are starting to act like they understand how the real world works, not just how to create images.
What's the problem?
Early video generation focused on short, visually appealing clips, but these lacked consistency and didn't allow for interaction. Simply making videos look realistic wasn't enough; the videos didn't understand physics or how things should behave in a believable way, making it hard to use them for things like training robots or creating realistic games.
What's the solution?
The paper breaks down these advanced video models into two main parts: a 'world model' and a 'video renderer'. The world model is like a brain that understands the rules of the world – how objects interact, how gravity works, and even how characters might behave. The video renderer then takes what's happening in this 'brain' and turns it into the video you actually see. The paper traces the development of these models through four stages, showing how they've become more sophisticated over time, eventually being able to create videos that are physically realistic and allow for interaction.
Why it matters?
This is important because it means we're getting closer to AI that can create truly immersive and useful virtual environments. These advancements have big implications for fields like robotics (training robots in simulation), self-driving cars (simulating driving scenarios), and gaming (creating more realistic and interactive game worlds). Ultimately, it's about building AI that doesn't just *show* you a world, but lets you *interact* with one.
Abstract
The landscape of video generation is shifting, from a focus on generating visually appealing clips to building virtual environments that support interaction and maintain physical plausibility. These developments point toward the emergence of video foundation models that function not only as visual generators but also as implicit world models, models that simulate the physical dynamics, agent-environment interactions, and task planning that govern real or imagined worlds. This survey provides a systematic overview of this evolution, conceptualizing modern video foundation models as the combination of two core components: an implicit world model and a video renderer. The world model encodes structured knowledge about the world, including physical laws, interaction dynamics, and agent behavior. It serves as a latent simulation engine that enables coherent visual reasoning, long-term temporal consistency, and goal-driven planning. The video renderer transforms this latent simulation into realistic visual observations, effectively producing videos as a "window" into the simulated world. We trace the progression of video generation through four generations, in which the core capabilities advance step by step, ultimately culminating in a world model, built upon a video generation model, that embodies intrinsic physical plausibility, real-time multimodal interaction, and planning capabilities spanning multiple spatiotemporal scales. For each generation, we define its core characteristics, highlight representative works, and examine their application domains such as robotics, autonomous driving, and interactive gaming. Finally, we discuss open challenges and design principles for next-generation world models, including the role of agent intelligence in shaping and evaluating these systems. An up-to-date list of related works is maintained at this link.