ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, Tianfan Xue

2026-03-30

ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

Summary

This paper introduces ShotStream, a new system for creating videos made up of multiple shots, like a short movie or story. It focuses on making this process faster and allowing for more control during creation.

What's the problem?

Currently, creating long videos with multiple scenes is difficult because existing methods are slow and don't allow for much interaction. The older systems usually look at the whole video at once, which takes a lot of computing power and time. Also, when generating videos step-by-step, small errors can build up and ruin the final result, and keeping everything visually consistent between shots is a challenge.

What's the solution?

ShotStream tackles this by generating videos shot-by-shot, like writing a story one paragraph at a time. It starts with a powerful video generator and trains it to predict the *next* shot based on what’s already been created. To keep things consistent, it uses two types of memory: one to remember the overall story so far, and another to remember details within the current shot. They also developed a clever training method that gradually introduces more complexity, starting with small, self-checking steps and then building up to longer sequences. A special indicator helps the system understand which memory to use for consistency.

Why it matters?

This work is important because it makes creating multi-shot videos much faster – up to 16 frames per second on a single computer – and allows users to influence the story as it unfolds. This opens the door to real-time interactive storytelling and more efficient video production, potentially letting anyone easily create longer, more complex videos.

Abstract

Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation. By reformulating the task as next-shot generation conditioned on historical context, ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. We achieve this by first fine-tuning a text-to-video model into a bidirectional next-shot generator, which is then distilled into a causal student via Distribution Matching Distillation. To overcome the challenges of inter-shot consistency and error accumulation inherent in autoregressive generation, we introduce two key innovations. First, a dual-cache memory mechanism preserves visual coherence: a global context cache retains conditional frames for inter-shot consistency, while a local context cache holds generated frames within the current shot for intra-shot consistency. And a RoPE discontinuity indicator is employed to explicitly distinguish the two caches to eliminate ambiguity. Second, to mitigate error accumulation, we propose a two-stage distillation strategy. This begins with intra-shot self-forcing conditioned on ground-truth historical shots and progressively extends to inter-shot self-forcing using self-generated histories, effectively bridging the train-test gap. Extensive experiments demonstrate that ShotStream generates coherent multi-shot videos with sub-second latency, achieving 16 FPS on a single GPU. It matches or exceeds the quality of slower bidirectional models, paving the way for real-time interactive storytelling. Training and inference code, as well as the models, are available on our

View Paper