Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

Qi Yang, Binjie Mao, Zili Wang, Xing Nie, Pengfei Gao, Ying Guo, Cheng Zhen, Pengfei Yan, Shiming Xiang

2024-09-11

Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

Summary

This paper talks about Draw an Audio, a new system designed to create audio for videos automatically, enhancing the sound experience in films by synchronizing sound effects with the visuals.

What's the problem?

Creating sound effects for videos (known as Foley) is challenging because the audio must match the actions and sounds seen in the video. This includes ensuring that the sounds are consistent with what is happening on screen and that they have the right volume and timing. Existing methods struggle with these issues, making it hard to produce high-quality audio that fits well with videos.

What's the solution?

To solve these problems, the authors developed Draw an Audio, which allows users to provide multiple instructions through drawn masks and loudness signals. The system uses a Mask-Attention Module (MAM) to focus on important parts of the video and a Time-Loudness Module (TLM) to ensure that the generated sounds match the video's volume and timing. They also expanded a dataset called VGGSound-Caption to help train their model effectively. Extensive testing showed that Draw an Audio performs better than previous methods in generating synchronized audio for videos.

Why it matters?

This research is important because it improves how sound effects are created for films and videos, making them more immersive and enjoyable for viewers. By automating this process, filmmakers can save time and resources while enhancing the overall quality of their productions.

Abstract

Foley is a term commonly used in filmmaking, referring to the addition of daily sound effects to silent films or videos to enhance the auditory experience. Video-to-Audio (V2A), as a particular type of automatic foley task, presents inherent challenges related to audio-visual synchronization. These challenges encompass maintaining the content consistency between the input video and the generated audio, as well as the alignment of temporal and loudness properties within the video. To address these issues, we construct a controllable video-to-audio synthesis model, termed Draw an Audio, which supports multiple input instructions through drawn masks and loudness signals. To ensure content consistency between the synthesized audio and target video, we introduce the Mask-Attention Module (MAM), which employs masked video instruction to enable the model to focus on regions of interest. Additionally, we implement the Time-Loudness Module (TLM), which uses an auxiliary loudness signal to ensure the synthesis of sound that aligns with the video in both loudness and temporal dimensions. Furthermore, we have extended a large-scale V2A dataset, named VGGSound-Caption, by annotating caption prompts. Extensive experiments on challenging benchmarks across two large-scale V2A datasets verify Draw an Audio achieves the state-of-the-art. Project page: https://yannqi.github.io/Draw-an-Audio/.

View Paper