YingVideo-MV: Music-Driven Multi-Stage Video Generation

Jiahui Chen, Weida Wang, Runhua Shi, Huan Yang, Chaofan Ding, Zihao Chen

2025-12-03

YingVideo-MV: Music-Driven Multi-Stage Video Generation

Summary

This paper introduces a new system called YingVideo-MV that automatically creates realistic videos of musical performances from just the audio. It's designed to handle long videos with natural movements and consistent visuals, something previous systems struggled with.

What's the problem?

Existing methods for creating videos from audio, like those used for making avatars talk, weren't good at generating longer, more complex videos of musical performances, especially when it came to including realistic camera movements. They often lacked control over how the camera moved and struggled to maintain consistency throughout the entire video, making the performance feel disjointed.

What's the solution?

The researchers developed YingVideo-MV, which works in stages. First, it analyzes the audio to understand the music. Then, a 'director' module plans out the shots and camera angles. Next, special 'diffusion transformer' models generate the video frames, and a 'camera adapter' controls the camera's position. To keep the video flowing smoothly, they also created a technique that adjusts how much detail is added to each frame based on the music. They also built a large dataset of music videos to train the system.

Why it matters?

This work is important because it opens the door to automatically creating high-quality music videos without needing a real camera crew or performers. This could be useful for musicians who want to easily create content, for generating visuals for music streaming services, or even for creating virtual concerts and performances. It represents a significant step forward in AI-powered video generation, specifically for the challenging task of music performance videos.

Abstract

While diffusion model for audio-driven avatar video generation have achieved notable process in synthesizing long sequences with natural audio-visual synchronization and identity consistency, the generation of music-performance videos with camera motions remains largely unexplored. We present YingVideo-MV, the first cascaded framework for music-driven long-video generation. Our approach integrates audio semantic analysis, an interpretable shot planning module (MV-Director), temporal-aware diffusion Transformer architectures, and long-sequence consistency modeling to enable automatic synthesis of high-quality music performance videos from audio signals. We construct a large-scale Music-in-the-Wild Dataset by collecting web data to support the achievement of diverse, high-quality results. Observing that existing long-video generation methods lack explicit camera motion control, we introduce a camera adapter module that embeds camera poses into latent noise. To enhance continulity between clips during long-sequence inference, we further propose a time-aware dynamic window range strategy that adaptively adjust denoising ranges based on audio embedding. Comprehensive benchmark tests demonstrate that YingVideo-MV achieves outstanding performance in generating coherent and expressive music videos, and enables precise music-motion-camera synchronization. More videos are available in our project page: https://giantailab.github.io/YingVideo-MV/ .

View Paper