VideoAuteur: Towards Long Narrative Video Generation

Junfei Xiao, Feng Cheng, Lu Qi, Liangke Gui, Jiepeng Cen, Zhibei Ma, Alan Yuille, Lu Jiang

2025-01-14

VideoAuteur: Towards Long Narrative Video Generation

Summary

This paper talks about a new way to make AI create long videos that tell a story, like a cooking show. The researchers made a special tool called VideoAuteur that can understand and create videos that make sense from start to finish.

What's the problem?

Right now, AI can make short video clips that look good, but it has trouble making longer videos that tell a clear story. It's hard for AI to keep track of what's happening and make sure everything makes sense from beginning to end, especially in videos that are more than a few seconds long.

What's the solution?

The researchers did a few things to solve this problem. First, they made a huge collection of cooking videos with detailed descriptions of what's happening. Then, they created a smart AI system called the Long Narrative Video Director. This system acts like a movie director, planning out the whole video to make sure the story makes sense. It creates a series of key images that represent the main parts of the story. Finally, they made the AI better at matching the words describing the video with the actual images, so the final video looks good and tells the right story.

Why it matters?

This matters because it could change how we make and use videos in the future. Imagine being able to type in a recipe and have an AI create a whole cooking show for you. Or think about making educational videos or even movies without needing a real camera or actors. This technology could make it easier and cheaper to create all kinds of videos, which could be used for teaching, entertainment, or even helping people learn new skills just by watching AI-generated videos.

Abstract

Recent video generation models have shown promising results in producing high-quality video clips lasting several seconds. However, these models face challenges in generating long sequences that convey clear and informative events, limiting their ability to support coherent narrations. In this paper, we present a large-scale cooking video dataset designed to advance long-form narrative generation in the cooking domain. We validate the quality of our proposed dataset in terms of visual fidelity and textual caption accuracy using state-of-the-art Vision-Language Models (VLMs) and video generation models, respectively. We further introduce a Long Narrative Video Director to enhance both visual and semantic coherence in generated videos and emphasize the role of aligning visual embeddings to achieve improved overall video quality. Our method demonstrates substantial improvements in generating visually detailed and semantically aligned keyframes, supported by finetuning techniques that integrate text and image embeddings within the video generation process. Project page: https://videoauteur.github.io/

View Paper