Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy
Teng Hu, Zhentao Yu, Guozhen Zhang, Zihan Su, Zhengguang Zhou, Youliang Zhang, Yuan Zhou, Qinglin Lu, Ran Yi
2025-11-27
Summary
This paper focuses on making AI better at creating videos with sound that perfectly match each other, something current AI struggles with.
What's the problem?
When AI tries to generate both audio and video at the same time, things often get out of sync. This happens for a few key reasons: the audio and video 'learn' separately and drift apart, the AI has trouble focusing on the precise timing needed for good synchronization, and the way AI is usually guided to create content actually makes it *less* focused on keeping the audio and video aligned.
What's the solution?
The researchers developed a new system called Harmony. It works in three main ways: first, it trains the AI to create video from audio *and* audio from video, helping them stay connected. Second, it uses a special module that efficiently focuses on both broad styles and tiny timing details. Finally, it improves the guidance process to specifically emphasize audio-visual synchronization during content creation.
Why it matters?
This research is important because it significantly improves the quality of AI-generated videos with sound. Better synchronization makes the videos more realistic and enjoyable to watch, and it pushes the field of generative AI forward by solving a major technical challenge.
Abstract
The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.