Video-Guided Foley Sound Generation with Multimodal Controls

Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, Andrew Owens, Justin Salamon

2024-11-28

Video-Guided Foley Sound Generation with Multimodal Controls

Summary

This paper introduces MultiFoley, a new system for generating sound effects for videos that allows users to create and customize sounds using text, audio, and video inputs.

What's the problem?

Creating sound effects for videos can be challenging because it often requires artistic sounds that don't match real-life noises. Existing methods may not provide enough control or flexibility for sound designers, making it hard to produce the right sounds for different scenes.

What's the solution?

MultiFoley addresses this problem by allowing users to input a silent video along with a text prompt to generate the desired sounds. For example, users can create clean sounds, like skateboard wheels spinning without background noise, or whimsical sounds, like making a lion's roar sound like a cat's meow. The system is trained on both low-quality internet videos and high-quality sound effects, enabling it to produce high-quality audio that matches the video content.

Why it matters?

This research is important because it enhances the way sound designers create audio for videos, making it easier to generate customized sound effects that fit specific scenes. By integrating multiple input types, MultiFoley opens up new possibilities for creative expression in film and media production.

Abstract

Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce MultiFoley, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video. Given a silent video and a text prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels spinning without wind noise) or more whimsical sounds (e.g., making a lion's roar sound like a cat's meow). MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning. A key novelty of our model lies in its joint training on both internet video datasets with low-quality audio and professional SFX recordings, enabling high-quality, full-bandwidth (48kHz) audio generation. Through automated evaluations and human studies, we demonstrate that MultiFoley successfully generates synchronized high-quality sounds across varied conditional inputs and outperforms existing methods. Please see our project page for video results: https://ificl.github.io/MultiFoley/

View Paper