Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji

2024-12-23

Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Summary

This paper talks about MMAudio, a new system designed to create high-quality audio that matches video content and can also take text descriptions into account. It uses a special training method to ensure the audio is synchronized with the video.

What's the problem?

Generating audio that fits well with video can be difficult, especially if the audio needs to be high quality and synchronized with the visuals. Traditional methods often rely only on limited video data, which can restrict the quality and relevance of the generated audio.

What's the solution?

MMAudio addresses this by using a multimodal joint training framework that combines both video and large-scale text-audio data. This allows it to learn how to produce audio that is not only high quality but also semantically aligned with the video content. Additionally, it includes a synchronization module that aligns the audio with specific frames in the video, ensuring everything sounds natural and matches perfectly.

Why it matters?

This research is important because it enhances how AI can generate audio for videos, making it useful for filmmakers, content creators, and anyone looking to add sound to their visual media. By improving synchronization and quality, MMAudio can help create more engaging and immersive experiences for viewers.

Abstract

We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: https://hkchengrex.github.io/MMAudio

View Paper