Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, Xiang Bai

2026-03-30

Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

Summary

This paper focuses on improving how AI models understand and remember things in videos, specifically when objects move in and out of view.

What's the problem?

Current AI models that try to predict what will happen in a video often struggle when something temporarily disappears and then reappears. They might forget what the object looked like, distort its shape, or even make it vanish completely, because they treat the entire scene as one unchanging thing instead of recognizing what's moving versus what's staying put.

What's the solution?

The researchers developed a new approach called Hybrid Memory. This system makes the AI act like both an archivist for the static background and a tracker for moving objects. It specifically remembers the background well and focuses on following the motion of objects, even when they're hidden. They also created a new dataset, HM-World, with lots of videos designed to test this kind of memory, and a new memory architecture called HyDRA that compresses information and focuses on relevant motion to keep track of hidden objects.

Why it matters?

This work is important because it makes video prediction AI much more realistic and reliable. If AI is going to be used in things like robotics or self-driving cars, it needs to be able to accurately understand and predict how the world around it changes, even when things are temporarily hidden from view. Better memory means better predictions and safer, more effective AI systems.

Abstract

Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.

View Paper