PAN: A World Model for General, Interactable, and Long-Horizon World Simulation

PAN Team, Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Yang, Kun Zhou, Davit Abrahamyan, Arif Ahmad, Ganesh Bannur, Junrong Chen, Kimi Chen, Mingkai Deng, Ruobing Han, Xinqi Huang, Haoqiang Kang, Zheqi Li, Enze Ma

2025-11-14

PAN: A World Model for General, Interactable, and Long-Horizon World Simulation

Summary

This paper introduces PAN, a new type of computer program designed to act as a 'world model'. Essentially, it tries to predict what will happen in the real world based on actions taken, and it does this using videos and language.

What's the problem?

Current video generation programs can create realistic videos, but they don't really 'understand' cause and effect. They just make videos based on what you ask for, without being able to plan or react to changes. Existing world models are good at predicting things in very specific areas, like games or physics simulations, but they struggle to work in the real, complex world and can't handle different types of interactions very well.

What's the solution?

The researchers created PAN, which uses a combination of two powerful techniques. First, it uses a large language model (like the one powering many chatbots) to understand actions described in words and predict how things will change. Second, it uses a video diffusion model to turn those predictions into realistic-looking videos. This allows PAN to simulate future events based on actions and maintain consistency over long periods of time, creating a believable and interactive world simulation.

Why it matters?

This work is important because it's a step towards creating AI that can truly understand and interact with the world around it. A good world model is crucial for AI to plan, solve problems, and make decisions in a realistic way, moving beyond just generating content to actually 'thinking' about the consequences of its actions.

Abstract

A world model enables an intelligent agent to imagine, predict, and reason about how the world evolves in response to its actions, and accordingly to plan and strategize. While recent video generation models produce realistic visual sequences, they typically operate in the prompt-to-full-video manner without causal control, interactivity, or long-horizon consistency required for purposeful reasoning. Existing world modeling efforts, on the other hand, often focus on restricted domains (e.g., physical, game, or 3D-scene dynamics) with limited depth and controllability, and struggle to generalize across diverse environments and interaction formats. In this work, we introduce PAN, a general, interactable, and long-horizon world model that predicts future world states through high-quality video simulation conditioned on history and natural language actions. PAN employs the Generative Latent Prediction (GLP) architecture that combines an autoregressive latent dynamics backbone based on a large language model (LLM), which grounds simulation in extensive text-based knowledge and enables conditioning on language-specified actions, with a video diffusion decoder that reconstructs perceptually detailed and temporally coherent visual observations, to achieve a unification between latent space reasoning (imagination) and realizable world dynamics (reality). Trained on large-scale video-action pairs spanning diverse domains, PAN supports open-domain, action-conditioned simulation with coherent, long-term dynamics. Extensive experiments show that PAN achieves strong performance in action-conditioned world simulation, long-horizon forecasting, and simulative reasoning compared to other video generators and world models, taking a step towards general world models that enable predictive simulation of future world states for reasoning and acting.

View Paper