HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives

Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, Yujun Shen, Huamin Qu

2025-10-24

HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives

Summary

This paper introduces HoloCine, a new AI model designed to create longer, more coherent videos from text prompts, essentially moving beyond just making short clips to actually 'filming' short scenes.

What's the problem?

Current text-to-video AI models are really good at making individual, short video clips, but they struggle to string those clips together into a meaningful story or scene with consistent characters and settings. They lack the ability to maintain a narrative across multiple shots, creating a disjointed viewing experience.

What's the solution?

The researchers built HoloCine, which works by thinking about the entire scene at once instead of clip by clip. It uses a clever system called 'Window Cross-Attention' to pinpoint exactly where in the video each part of the text prompt should be shown, giving the AI directorial control. To keep things efficient, it focuses detailed attention *within* each shot but uses a lighter touch *between* shots, allowing it to generate longer videos without needing massive computing power.

Why it matters?

This work is a big step towards fully automated filmmaking. Instead of just generating isolated video snippets, HoloCine can create videos with a sense of continuity, remembering characters and even understanding basic cinematic techniques. This could eventually lead to AI tools that can create entire short films from just a written script, opening up new possibilities for content creation.

Abstract

State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future. Our code is available at: https://holo-cine.github.io/.

View Paper