Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video

Ciara Rowles, Varun Jampani, Simon Donné, Shimon Vainer, Julian Parker, Zach Evans

2025-10-27

Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video

Summary

This paper introduces a new technique called Foley Control, which is a way to automatically create sound effects (Foley) for videos using artificial intelligence. It's designed to be efficient and doesn't require retraining large AI models from scratch.

What's the problem?

Creating realistic sound effects that perfectly match video is difficult and often requires a lot of manual work. Existing AI systems that try to do this often need to be completely retrained when you want to change something, like using a different sound generator, and they can be very computationally expensive, needing a lot of processing power and memory.

What's the solution?

Foley Control solves this by keeping the core AI models for video and audio separate and already trained. It then creates a small 'bridge' between them using something called cross-attention. This bridge learns how the video and audio relate to each other, allowing the video to influence the timing and details of the sound effects while still letting text prompts control the overall sound. Importantly, they reduce the amount of video information used to make this connection, saving memory and making training more stable.

Why it matters?

This method is important because it makes video sound effect generation much more practical. It's faster, uses fewer resources, and allows for easy swapping of different components without needing to start over with training. This could be a big step towards making high-quality, automatically generated sound effects accessible for more video projects, and the core idea could even be applied to other types of audio generation like speech.

Abstract

Foley Control is a lightweight approach to video-guided Foley that keeps pretrained single-modality models frozen and learns only a small cross-attention bridge between them. We connect V-JEPA2 video embeddings to a frozen Stable Audio Open DiT text-to-audio (T2A) model by inserting compact video cross-attention after the model's existing text cross-attention, so prompts set global semantics while video refines timing and local dynamics. The frozen backbones retain strong marginals (video; audio given text) and the bridge learns the audio-video dependency needed for synchronization -- without retraining the audio prior. To cut memory and stabilize training, we pool video tokens before conditioning. On curated video-audio benchmarks, Foley Control delivers competitive temporal and semantic alignment with far fewer trainable parameters than recent multi-modal systems, while preserving prompt-driven controllability and production-friendly modularity (swap/upgrade encoders or the T2A backbone without end-to-end retraining). Although we focus on Video-to-Foley, the same bridge design can potentially extend to other audio modalities (e.g., speech).

View Paper