MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives
Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, Hengshuang Zhao
2025-12-17
Summary
This paper focuses on improving how AI creates long, continuous videos, specifically addressing the challenge of keeping the video consistent and making sense over time.
What's the problem?
When AI generates videos piece by piece (streaming), it needs to 'remember' what happened earlier to ensure everything flows together logically. Existing methods try to compress and store past video frames, but they use the same storage strategy for everything, which isn't ideal because different parts of the video require focusing on different past events. It's hard for the AI to pick out the *right* memories to use when creating new sections of the video, especially if the story changes or the scene shifts.
What's the solution?
The researchers developed a system called MemFlow that dynamically updates its 'memory' before generating each new part of the video. Instead of storing everything equally, MemFlow retrieves and prioritizes the past video frames that are most relevant to the current text prompt describing what should happen next. It also focuses on only the most important parts of those past frames during the actual video creation process, making it efficient. This way, the AI can maintain a coherent narrative even when new things happen in the video.
Why it matters?
This work is important because it allows AI to generate much longer and more consistent videos without significantly slowing down the process. It's a relatively small performance hit (about 8% slower) to achieve a big improvement in video quality and coherence, and it can be added to existing video generation systems without major changes.
Abstract
The core challenge for streaming video generation is maintaining the content consistency in long context, which poses high requirement for the memory design. Most existing solutions maintain the memory by compressing historical frames with predefined strategies. However, different to-generate video chunks should refer to different historical cues, which is hard to satisfy with fixed strategies. In this work, we propose MemFlow to address this problem. Specifically, before generating the coming chunk, we dynamically update the memory bank by retrieving the most relevant historical frames with the text prompt of this chunk. This design enables narrative coherence even if new event happens or scenario switches in future frames. In addition, during generation, we only activate the most relevant tokens in the memory bank for each query in the attention layers, which effectively guarantees the generation efficiency. In this way, MemFlow achieves outstanding long-context consistency with negligible computation burden (7.9% speed reduction compared with the memory-free baseline) and keeps the compatibility with any streaming video generation model with KV cache.