WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion

Hanyang Kong, Xingyi Yang, Xiaoxu Zheng, Xinchao Wang

2025-12-23

WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion

Summary

This paper introduces a new method called WorldWarp for creating realistic and consistent videos, especially when the camera moves around a scene.

What's the problem?

Creating long videos with a moving camera is really hard for current AI models. They struggle to keep things looking geometrically correct, meaning objects don't warp or disappear strangely, especially when parts of the scene are hidden from view or when the camera takes a complex path. Existing AI excels at creating images, but often doesn't fully understand the 3D structure of the world, leading to inconsistencies.

What's the solution?

WorldWarp tackles this by combining two main ideas. First, it builds a constantly updated 3D 'map' of the scene using a technique called Gaussian Splatting, which acts like a structural base. This map ensures that new frames respect the existing 3D shapes. Second, it uses a diffusion model – a type of AI that's good at generating details – to fill in any gaps or imperfections that arise from simply warping the existing scene. Importantly, the diffusion model adds more detail to areas that were warped and completely generates new content for areas that were previously hidden. This process happens repeatedly, updating the 3D map and refining the image at each step.

Why it matters?

This research is important because it significantly improves the quality and realism of generated videos, particularly those with complex camera movements. By ensuring geometric consistency, WorldWarp creates videos where objects behave as they would in the real world, making the generated content much more believable and useful for applications like virtual reality, filmmaking, and more.

Abstract

Generating long-range, geometrically consistent video presents a fundamental dilemma: while consistency demands strict adherence to 3D geometry in pixel space, state-of-the-art generative models operate most effectively in a camera-conditioned latent space. This disconnect causes current methods to struggle with occluded areas and complex camera trajectories. To bridge this gap, we propose WorldWarp, a framework that couples a 3D structural anchor with a 2D generative refiner. To establish geometric grounding, WorldWarp maintains an online 3D geometric cache built via Gaussian Splatting (3DGS). By explicitly warping historical content into novel views, this cache acts as a structural scaffold, ensuring each new frame respects prior geometry. However, static warping inevitably leaves holes and artifacts due to occlusions. We address this using a Spatio-Temporal Diffusion (ST-Diff) model designed for a "fill-and-revise" objective. Our key innovation is a spatio-temporal varying noise schedule: blank regions receive full noise to trigger generation, while warped regions receive partial noise to enable refinement. By dynamically updating the 3D cache at every step, WorldWarp maintains consistency across video chunks. Consequently, it achieves state-of-the-art fidelity by ensuring that 3D logic guides structure while diffusion logic perfects texture. Project page: https://hyokong.github.io/worldwarp-page/{https://hyokong.github.io/worldwarp-page/}.

View Paper