Bridging Your Imagination with Audio-Video Generation via a Unified Director
Jiaxu Zhang, Tianshu Hu, Yuan Zhang, Zenan Li, Linjie Luo, Guosheng Lin, Xin Chen
2025-12-30
Summary
This paper introduces a new AI system called UniMAGE that aims to create videos from text prompts more effectively by combining the processes of writing the story (script drafting) and planning the visuals (key-shot design) into one step, mimicking how a human film director works.
What's the problem?
Current AI video creation tools handle script writing and visual planning separately, using different AI models for each. This separation limits the creative potential because the visual ideas aren't directly informed by the story, and vice versa. It's like having a writer and a storyboard artist who don't communicate well – the final product might not be as cohesive or imaginative as it could be.
What's the solution?
The researchers developed UniMAGE, which uses a special AI architecture called a Mixture-of-Transformers to handle both text and image generation simultaneously. They also created a training method with two key parts: first, the AI learns to connect text and images deeply, understanding how words translate into visuals. Second, the AI learns to separate the tasks of writing the script and creating the keyframes, allowing for more flexibility and creative control over the storytelling process.
Why it matters?
This work is important because it moves AI video creation closer to producing high-quality, long-form videos with a clear narrative. By unifying the script and visual planning stages, UniMAGE can generate videos that are not only visually appealing but also make logical sense, potentially allowing anyone to create compelling films without needing specialized skills.
Abstract
Existing AI-driven video creation systems typically treat script drafting and key-shot design as two disjoint tasks: the former relies on large language models, while the latter depends on image generation models. We argue that these two tasks should be unified within a single framework, as logical reasoning and imaginative thinking are both fundamental qualities of a film director. In this work, we propose UniMAGE, a unified director model that bridges user prompts with well-structured scripts, thereby empowering non-experts to produce long-context, multi-shot films by leveraging existing audio-video generation models. To achieve this, we employ the Mixture-of-Transformers architecture that unifies text and image generation. To further enhance narrative logic and keyframe consistency, we introduce a ``first interleaving, then disentangling'' training paradigm. Specifically, we first perform Interleaved Concept Learning, which utilizes interleaved text-image data to foster the model's deeper understanding and imaginative interpretation of scripts. We then conduct Disentangled Expert Learning, which decouples script writing from keyframe generation, enabling greater flexibility and creativity in storytelling. Extensive experiments demonstrate that UniMAGE achieves state-of-the-art performance among open-source models, generating logically coherent video scripts and visually consistent keyframe images.