MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

Wei Yu, Runjia Qian, Yumeng Li, Liquan Wang, Songheng Yin, Sri Siddarth Chakaravarthy P, Dennis Anthony, Yang Ye, Yidi Li, Weiwei Wan, Animesh Garg

2026-03-19

MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

Summary

This paper introduces a new way to help video-generating AI models create more realistic and consistent videos, especially when the 'camera' moves around or revisits parts of the scene.

What's the problem?

Current AI models that generate videos struggle with maintaining a consistent 'memory' of the scene. If you try to show the same scene from a slightly different angle, or even revisit it later in the video, things can look wrong. Some methods try to build a 3D model of the scene, but they have trouble with things that *move*. Other methods remember things implicitly, but often get the camera angles incorrect, making the video feel unstable.

What's the solution?

The researchers developed a system called Mosaic Memory, which combines the best of both worlds. It takes small pieces of the image and 'lifts' them into a 3D space to keep track of where things are, but it also uses the AI model's existing ability to understand what it's supposed to be creating. This allows it to accurately place things in the scene and make sure moving objects look natural. It essentially builds the scene by carefully patching together what should stay the same and then filling in what needs to change.

Why it matters?

This work is important because it brings us closer to AI models that can create truly immersive and interactive virtual worlds. Being able to navigate these worlds, edit scenes, and have the video continue realistically is a big step forward for things like virtual reality, filmmaking, and even just creating cool videos.

Abstract

Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model's native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing, and autoregressive rollout.

View Paper