UniVerse-1: Unified Audio-Video Generation via Stitching of Experts
Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili Dai, Daxin Jiang, Gang Yu
2025-09-09
Summary
This paper introduces UniVerse-1, a new AI model that can create both video and audio at the same time, making them work together seamlessly.
What's the problem?
Creating realistic audio and video together is difficult because it requires a lot of training data and ensuring the sound perfectly matches what's happening in the video. Existing methods often struggle with this coordination, and training a model from scratch is very resource intensive.
What's the solution?
The researchers didn't start from zero. They took pre-trained models that were already good at making videos and music separately, and then cleverly combined them using a technique called 'stitching of experts'. They also built a system to automatically label the training data with accurate sound and video timings, avoiding errors that come with relying on text descriptions. Finally, they trained this combined model on a large amount of audio-video data.
Why it matters?
This work is important because it pushes the boundaries of AI's ability to generate realistic and synchronized audio-visual content. By releasing their model and a new testing dataset, they hope to help other researchers improve this technology and close the gap with leading models like Veo3, ultimately leading to better AI-generated videos with convincing sound.
Abstract
We introduce UniVerse-1, a unified, Veo-3-like model capable of simultaneously generating coordinated audio and video. To enhance training efficiency, we bypass training from scratch and instead employ a stitching of experts (SoE) technique. This approach deeply fuses the corresponding blocks of pre-trained video and music generation experts models, thereby fully leveraging their foundational capabilities. To ensure accurate annotations and temporal alignment for both ambient sounds and speech with video content, we developed an online annotation pipeline that processes the required training data and generates labels during training process. This strategy circumvents the performance degradation often caused by misalignment text-based annotations. Through the synergy of these techniques, our model, after being finetuned on approximately 7,600 hours of audio-video data, produces results with well-coordinated audio-visuals for ambient sounds generation and strong alignment for speech generation. To systematically evaluate our proposed method, we introduce Verse-Bench, a new benchmark dataset. In an effort to advance research in audio-video generation and to close the performance gap with state-of-the-art models such as Veo3, we make our model and code publicly available. We hope this contribution will benefit the broader research community. Project page: https://dorniwang.github.io/UniVerse-1/.