UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions

Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, Youliang Zhang, Yi Chen, Yuan Zhou, Qinglin Lu, Limin Wang

2025-11-06

UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions

Summary

This paper introduces a new system called UniAVGen that's designed to create both audio and video together, aiming for better quality and realism than existing methods.

What's the problem?

Currently, when computers try to generate videos with matching sound, they often struggle to get the lip movements to sync correctly with the speech, and the overall video and audio don't always make sense together. Existing open-source methods don't effectively connect the audio and video information during creation, leading to these issues of poor lip sync and inconsistent meaning.

What's the solution?

UniAVGen solves this by using a special design with two parts working in parallel, both based on a technology called Diffusion Transformers. The key is a new way for the audio and video parts to 'talk' to each other – it's like a two-way street where they constantly check and adjust to stay aligned in time and meaning. It also focuses on the face in the video, making sure the mouth movements are prioritized when matching them to the sound. Finally, it uses a technique to strengthen the connection between the audio and video during the final creation process.

Why it matters?

This research is important because it allows for the creation of more realistic and believable audio-video content with significantly less training data than previous approaches. This means better AI-generated videos, and the system can handle different tasks like creating videos from audio, adding sound to existing videos, or continuing a video with matching audio, all within one model.

Abstract

Due to the lack of effective cross-modal modeling, existing open-source audio-video generation methods often exhibit compromised lip synchronization and insufficient semantic consistency. To mitigate these drawbacks, we propose UniAVGen, a unified framework for joint audio and video generation. UniAVGen is anchored in a dual-branch joint synthesis architecture, incorporating two parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which enables bidirectional, temporally aligned cross-attention, thus ensuring precise spatiotemporal synchronization and semantic consistency. Furthermore, this cross-modal interaction is augmented by a Face-Aware Modulation module, which dynamically prioritizes salient regions in the interaction process. To enhance generative fidelity during inference, we additionally introduce Modality-Aware Classifier-Free Guidance, a novel strategy that explicitly amplifies cross-modal correlation signals. Notably, UniAVGen's robust joint synthesis design enables seamless unification of pivotal audio-video tasks within a single model, such as joint audio-video generation and continuation, video-to-audio dubbing, and audio-driven video synthesis. Comprehensive experiments validate that, with far fewer training samples (1.3M vs. 30.1M), UniAVGen delivers overall advantages in audio-video synchronization, timbre consistency, and emotion consistency.

View Paper