MagicWorld: Interactive Geometry-driven Video World Exploration
Guangyuan Li, Siming Zheng, Shuolin Xu, Jinwei Chen, Bo Li, Xiaobin Hu, Lei Zhao, Peng-Tao Jiang
2025-11-26
Summary
This paper introduces a new system called MagicWorld that creates evolving video scenes based on what a user tells it to do. It's designed to make these generated videos more realistic and consistent over time.
What's the problem?
Current systems that do this struggle with two main issues. First, they don't really understand how things look in 3D, so when you change your viewing angle, the scene can fall apart and look structurally incorrect. Second, they tend to 'forget' what happened earlier in the video, leading to mistakes and inconsistencies building up as the interaction goes on, like objects changing size or position randomly.
What's the solution?
MagicWorld tackles these problems in two key ways. It uses something called an Action-Guided 3D Geometry Module which builds a basic 3D model of the scene from the first image and then updates it based on the user's actions, ensuring things stay consistent from different viewpoints. It also uses a History Cache Retrieval mechanism, which is like a memory that pulls up relevant past frames to help the system remember what's already happened and avoid making mistakes as the video progresses.
Why it matters?
This work is important because it makes interactive video generation much more believable and useful. By maintaining structural integrity and remembering past events, MagicWorld creates scenes that feel more real and allows for longer, more complex interactions without the video falling apart.
Abstract
Recent interactive video world model methods generate scene evolution conditioned on user instructions. Although they achieve impressive results, two key limitations remain. First, they fail to fully exploit the correspondence between instruction-driven scene motion and the underlying 3D geometry, which results in structural instability under viewpoint changes. Second, they easily forget historical information during multi-step interaction, resulting in error accumulation and progressive drift in scene semantics and structure. To address these issues, we propose MagicWorld, an interactive video world model that integrates 3D geometric priors and historical retrieval. MagicWorld starts from a single scene image, employs user actions to drive dynamic scene evolution, and autoregressively synthesizes continuous scenes. We introduce the Action-Guided 3D Geometry Module (AG3D), which constructs a point cloud from the first frame of each interaction and the corresponding action, providing explicit geometric constraints for viewpoint transitions and thereby improving structural consistency. We further propose History Cache Retrieval (HCR) mechanism, which retrieves relevant historical frames during generation and injects them as conditioning signals, helping the model utilize past scene information and mitigate error accumulation. Experimental results demonstrate that MagicWorld achieves notable improvements in scene stability and continuity across interaction iterations.