AutoMV: An Automatic Multi-Agent System for Music Video Generation

Xiaoxuan Tang, Xinping Lei, Chaoran Zhu, Shiyun Chen, Ruibin Yuan, Yizhi Li, Changjae Oh, Ge Zhang, Wenhao Huang, Emmanouil Benetos, Yang Liu, Jiaheng Liu, Yinghao Ma

2025-12-16

AutoMV: An Automatic Multi-Agent System for Music Video Generation

Summary

This paper introduces a new system called AutoMV that automatically creates full-length music videos from just the song itself, aiming to overcome the limitations of existing methods that produce short and choppy videos.

What's the problem?

Currently, making music videos from songs automatically is really hard. Existing computer programs can only generate short clips that don't really flow with the music's structure, beat, or lyrics, and the videos often don't make sense from beginning to end – they lack consistency. It's difficult for computers to understand the music and translate that into a visually appealing and coherent video.

What's the solution?

The researchers developed AutoMV, which works like a team of specialized agents. First, it analyzes the song to understand its structure, vocals, and lyrics. Then, a 'screenwriter' agent creates a basic storyline, and a 'director' agent figures out the characters and camera angles. These agents then request images and video clips from other programs. Finally, a 'verifier' agent checks everything to make sure it all fits together and looks good, allowing the agents to work together to create a complete music video. They also created a detailed set of criteria to judge how good these automatically generated videos are.

Why it matters?

This research is important because it brings us closer to being able to automatically generate high-quality music videos. AutoMV performs significantly better than previous methods and is even approaching the quality of videos made by professionals, as judged by experts. It also explores using AI to *judge* music videos, which could help improve these systems even further, opening up possibilities for more accessible and creative music video production.

Abstract

Music-to-Video (M2V) generation for full-length songs faces significant challenges. Existing methods produce short, disjointed clips, failing to align visuals with musical structure, beats, or lyrics, and lack temporal consistency. We propose AutoMV, a multi-agent system that generates full music videos (MVs) directly from a song. AutoMV first applies music processing tools to extract musical attributes, such as structure, vocal tracks, and time-aligned lyrics, and constructs these features as contextual inputs for following agents. The screenwriter Agent and director Agent then use this information to design short script, define character profiles in a shared external bank, and specify camera instructions. Subsequently, these agents call the image generator for keyframes and different video generators for "story" or "singer" scenes. A Verifier Agent evaluates their output, enabling multi-agent collaboration to produce a coherent longform MV. To evaluate M2V generation, we further propose a benchmark with four high-level categories (Music Content, Technical, Post-production, Art) and twelve ine-grained criteria. This benchmark was applied to compare commercial products, AutoMV, and human-directed MVs with expert human raters: AutoMV outperforms current baselines significantly across all four categories, narrowing the gap to professional MVs. Finally, we investigate using large multimodal models as automatic MV judges; while promising, they still lag behind human expert, highlighting room for future work.

View Paper