FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds
Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Kai Chen
2024-07-03

Summary
This paper talks about FoleyCrafter, a new system that automatically adds realistic sound effects to silent videos, making them more engaging and immersive by ensuring the sounds match the visuals perfectly.
What's the problem?
The main problem is that existing methods for adding sound effects to videos often struggle to create high-quality audio that is both relevant to the video's content and synchronized with the action on screen. This can lead to sounds that don't match what viewers see, which can ruin the experience.
What's the solution?
To solve this issue, the authors developed FoleyCrafter, which uses a pre-trained text-to-audio model. It has two main parts: a semantic adapter that makes sure the sounds are appropriate for what’s happening in the video, and a temporal controller that ensures the sounds happen at the right time. This means if a dog barks in the video, the sound will play exactly when the dog appears. Additionally, users can provide text descriptions to guide what sounds should be added, allowing for more control over the final audio.
Why it matters?
This research is important because it enhances how we create and experience videos. By improving the way sound effects are added, FoleyCrafter helps make videos more realistic and enjoyable, which is valuable for filmmakers, game developers, and anyone working with visual media.
Abstract
We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience. Despite its wide range of applications, existing approaches encounter limitations when it comes to simultaneously synthesizing high-quality and video-aligned (i.e.,, semantic relevant and temporal synchronized) sounds. To overcome these limitations, we propose FoleyCrafter, a novel framework that leverages a pre-trained text-to-audio model to ensure high-quality audio generation. FoleyCrafter comprises two key components: the semantic adapter for semantic alignment and the temporal controller for precise audio-video synchronization. The semantic adapter utilizes parallel cross-attention layers to condition audio generation on video features, producing realistic sound effects that are semantically relevant to the visual content. Meanwhile, the temporal controller incorporates an onset detector and a timestampbased adapter to achieve precise audio-video alignment. One notable advantage of FoleyCrafter is its compatibility with text prompts, enabling the use of text descriptions to achieve controllable and diverse video-to-audio generation according to user intents. We conduct extensive quantitative and qualitative experiments on standard benchmarks to verify the effectiveness of FoleyCrafter. Models and codes are available at https://github.com/open-mmlab/FoleyCrafter.