Co-Director: Agentic Generative Video Storytelling

Yale Song, Yiwen Song, Nick Losier, Nathan Hodson, Ye Jin, Rhyard Zhu, Yan Xu, Daniel Vlasic, Carina Claassen, Jasmine Leon, Khanh G. LeViet, Zack Chomyn, Joe Timmons, Brett Slatkin, Scott Penberthy, Tomas Pfister

2026-04-29

Co-Director: Agentic Generative Video Storytelling

Summary

This paper introduces a new system called Co-Director that aims to create more coherent and story-like videos using artificial intelligence, specifically diffusion models which are good at generating realistic images and videos.

What's the problem?

While AI can now generate impressive individual video clips, it's hard to string them together into a meaningful story. Existing methods try to do this by linking together different AI modules, but these often fall apart because each module works independently and small errors build up, leading to a video that doesn't make sense or loses track of what it's trying to show.

What's the solution?

Co-Director tackles this by treating video storytelling as a problem of finding the best overall approach. It uses a 'global' system, like a multi-armed bandit, to explore different creative ideas for the story. Then, it has a 'local' system that focuses on making sure each part of the video stays consistent and doesn't change the subject unexpectedly. This combination allows it to try new things while still maintaining a clear and understandable narrative.

Why it matters?

This research is important because it provides a more reliable way to automatically generate videos that actually tell a story. The researchers also created a new dataset specifically for testing these kinds of systems, which will help others improve AI video creation in the future, and it shows promise for creating personalized advertising or even longer, more complex cinematic narratives.

Abstract

While diffusion models generate high-fidelity video clips, transforming them into coherent storytelling engines remains challenging. Current agentic pipelines automate this via chained modules but suffer from semantic drift and cascading failures due to independent, handcrafted prompting. We present Co-Director, a hierarchical multi-agent framework formalizing video storytelling as a global optimization problem. To ensure semantic coherence, we introduce hierarchical parameterization: a multi-armed bandit globally identifies promising creative directions, while a local multimodal self-refinement loop mitigates identity drift and ensures sequence-level consistency. This balances the exploration of novel narrative strategies with the exploitation of effective creative configurations. For evaluation, we introduce GenAD-Bench, a 400-scenario dataset of fictional products for personalized advertising. Experiments demonstrate that Co-Director significantly outperforms state-of-the-art baselines, offering a principled approach that seamlessly generalizes to broader cinematic narratives. Project Page: https://co-director-agent.github.io/

View Paper