INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Hujun Bao, Hongjia Zhai, Jialin Liu, Jing Guo, Nan Wang, Siji Pan, Weihong Pan, Weijian Xie, Xianbin Liu, Xiaojun Xiang, Xiaoyu Zhang, Xinyu Chen, Yifu Wang, Yipeng Chen, Zhenzhou Fan, Zhewen Le, Zhichao Ye

2026-04-09

INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

Summary

This paper introduces a new system called INSPATIO-WORLD that creates realistic, interactive 3D worlds from just a single video. Imagine being able to step *into* a video and explore it – that’s what this research aims to make possible.

What's the problem?

Currently, creating these kinds of interactive 3D worlds is really hard. Existing methods often produce worlds that aren't visually realistic, or where things change inconsistently as you move around. For example, objects might appear and disappear, or the scene might not make physical sense. It’s difficult to build a world that feels truly real and responds predictably to your actions, especially when trying to do it quickly in real-time.

What's the solution?

The researchers developed INSPATIO-WORLD, which uses a clever combination of techniques. It has a 'memory' called an Implicit Spatiotemporal Cache that remembers what the scene looks like and keeps everything consistent over time. It also has a module that makes sure the 3D geometry is correct and that your movements within the world feel natural. Finally, they used a technique called Joint Distribution Matching Distillation to make the generated worlds look more realistic by learning from real-world images and videos, rather than relying solely on computer-generated training data.

Why it matters?

This work is important because it represents a significant step forward in creating truly immersive and interactive experiences. It could have applications in virtual reality, robotics, and even filmmaking, allowing people to explore and interact with digital environments in a much more natural and believable way. The system performs better than other similar technologies, especially when it comes to maintaining a consistent and realistic world while allowing for real-time interaction.

Abstract

Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.

View Paper