Spatia: Video Generation with Updatable Spatial Memory

Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, Yan Lu

2025-12-26

Spatia: Video Generation with Updatable Spatial Memory

Summary

This paper introduces a new approach to creating videos, called Spatia, that focuses on making sure the videos look consistent and realistic over their entire length.

What's the problem?

Currently, making videos with computers is hard because videos have a lot of information – every pixel changes over time, and everything needs to make sense in 3D space. Existing computer programs often struggle to keep track of all this information, leading to videos where objects change shape or location unexpectedly, or where things just don't look quite right from different angles.

What's the solution?

Spatia solves this by creating a kind of 'memory' of the 3D scene as a point cloud, like a digital sculpture. As the video is generated, the program constantly refers back to this 3D memory to ensure everything stays consistent. It also uses a technique similar to how robots map out their surroundings (visual SLAM) to update this memory as the video progresses, allowing for both stable scenes and moving objects. Essentially, it separates what's static in the scene from what's changing.

Why it matters?

This is important because it allows for more control over video creation. Imagine being able to easily change the camera angle in a generated video, or edit the 3D scene directly. Spatia opens the door to these possibilities, making it easier to create high-quality, realistic videos with more flexibility and control than before.

Abstract

Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory-aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic-static disentanglement design enhances spatial consistency throughout the generation process while preserving the model's ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.

View Paper