GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Kaiyi Huang, Yukun Huang, Xuefei Ning, Zinan Lin, Yu Wang, Xihui Liu

2024-12-09

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Summary

This paper talks about GenMAC, a new system that uses multiple AI agents to generate videos from text descriptions, allowing for more complex and dynamic scenes.

What's the problem?

While text-to-video generation has improved, existing models struggle with creating detailed scenes based on complex text prompts. They have difficulty handling multiple objects, their interactions, and changes over time, which makes it hard to produce realistic and engaging videos.

What's the solution?

The authors developed GenMAC, a multi-agent framework that breaks down the video generation process into three stages: Design, Generation, and Redesign. In the Design stage, the system plans the layout of objects based on the text. During Generation, it creates the video using this layout. The Redesign stage checks the generated video against the original text prompt and suggests corrections. To enhance accuracy, this stage uses four specialized agents that verify, suggest improvements, correct errors, and structure the final output. GenMAC also includes a self-routing mechanism that selects the best agent for each specific task.

Why it matters?

This research is important because it significantly improves how AI can create videos from text. By using a collaborative approach with multiple specialized agents, GenMAC can produce more accurate and dynamic videos that better match user expectations. This advancement could lead to better applications in entertainment, education, and content creation where high-quality video generation is essential.

Abstract

Text-to-video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects, temporal dynamics associated with different objects, and interactions between objects. Our key motivation is that complex tasks can be decomposed into simpler ones, each handled by a role-specialized MLLM agent. Multiple agents can collaborate together to achieve collective intelligence for complex goals. We propose GenMAC, an iterative, multi-agent framework that enables compositional text-to-video generation. The collaborative workflow includes three stages: Design, Generation, and Redesign, with an iterative loop between the Generation and Redesign stages to progressively verify and refine the generated videos. The Redesign stage is the most challenging stage that aims to verify the generated videos, suggest corrections, and redesign the text prompts, frame-wise layouts, and guidance scales for the next iteration of generation. To avoid hallucination of a single MLLM agent, we decompose this stage to four sequentially-executed MLLM-based agents: verification agent, suggestion agent, correction agent, and output structuring agent. Furthermore, to tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a collection of correction agents each specialized for one scenario. Extensive experiments demonstrate the effectiveness of GenMAC, achieving state-of-the art performance in compositional text-to-video generation.

View Paper