OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, Chenyang Zhang, Tao Xiang, Fanny Yang, Serge Belongie, Tian Xie

2025-12-10

OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

Summary

This paper introduces a new method called OneStory for creating longer, more coherent videos from a series of shorter clips, aiming to mimic how stories are told in real life.

What's the problem?

Currently, creating videos by stitching together multiple shots is difficult because existing methods struggle to 'remember' what happened in earlier shots for very long. They either focus on very recent clips or rely on just a single key image, which isn't enough to maintain a consistent storyline when the narrative gets complex. This leads to videos that don't quite make sense or feel disjointed.

What's the solution?

OneStory tackles this by treating the process like predicting the *next* shot in a video. It uses two main techniques: first, it intelligently selects important frames from previous shots to create a 'memory' of what's already happened. Second, it focuses on the most important parts of that memory to efficiently guide the creation of the next shot, building upon existing image-to-video technology. They also created a new dataset of videos with detailed descriptions to help train and test their system.

Why it matters?

This research is important because it moves us closer to being able to automatically generate long-form, story-driven videos that are both visually appealing and logically consistent. This has potential applications in areas like automated content creation, personalized video experiences, and even helping people create videos more easily.

Abstract

Storytelling in real-world videos often unfolds through multiple shots -- discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots, and an Adaptive Conditioner that performs importance-guided patchification to generate compact context for direct conditioning. We further curate a high-quality multi-shot dataset with referential captions to mirror real-world storytelling patterns, and design effective training strategies under the next-shot paradigm. Finetuned from a pretrained I2V model on our curated 60K dataset, OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling.

View Paper