StoryMem: Multi-shot Long Video Storytelling with Memory
Kaiwen Zhang, Liming Jiang, Angtian Wang, Jacob Zhiyuan Fang, Tiancheng Zhi, Qing Yan, Hao Kang, Xin Lu, Xingang Pan
2025-12-23
Summary
This paper introduces a new way to create longer, more coherent videos from text prompts, aiming for a cinematic quality similar to how a story unfolds in a movie.
What's the problem?
Currently, creating videos that tell a story over multiple scenes, or 'shots,' is really hard. Existing methods struggle to maintain consistency between these shots – things might change unexpectedly, characters might look different, or the overall story might feel disjointed. It's difficult to make a video that feels like a single, flowing narrative instead of a series of unrelated clips.
What's the solution?
The researchers developed a system called StoryMem that works like a memory. It generates videos one shot at a time, but it also keeps a 'memory bank' of key images from previous shots. When creating a new shot, it refers back to this memory to ensure everything stays consistent. They use a clever technique to add this memory to the video generation process without needing to retrain the entire system, making it efficient. They also developed a way to automatically pick the most important images for the memory and ensure the story flows smoothly between shots.
Why it matters?
This work is a big step forward in automated video creation. It allows for the generation of longer, more visually appealing, and logically consistent videos, opening up possibilities for things like automatically creating short films or personalized stories. The new benchmark they created will also help other researchers improve their video generation techniques.
Abstract
Visual storytelling requires generating multi-shot videos with cinematic quality and long-range consistency. Inspired by human memory, we propose StoryMem, a paradigm that reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, transforming pre-trained single-shot video diffusion models into multi-shot storytellers. This is achieved by a novel Memory-to-Video (M2V) design, which maintains a compact and dynamically updated memory bank of keyframes from historical generated shots. The stored memory is then injected into single-shot video diffusion models via latent concatenation and negative RoPE shifts with only LoRA fine-tuning. A semantic keyframe selection strategy, together with aesthetic preference filtering, further ensures informative and stable memory throughout generation. Moreover, the proposed framework naturally accommodates smooth shot transitions and customized story generation applications. To facilitate evaluation, we introduce ST-Bench, a diverse benchmark for multi-shot video storytelling. Extensive experiments demonstrate that StoryMem achieves superior cross-shot consistency over previous methods while preserving high aesthetic quality and prompt adherence, marking a significant step toward coherent minute-long video storytelling.