AVControl: Efficient Framework for Training Audio-Visual Controls

Matan Ben-Yosef, Tavi Halperin, Naomi Ken Korem, Mohammad Salama, Harel Cain, Asaf Joseph, Anthony Chen, Urska Jelercic, Ofir Bibi

2026-03-28

AVControl: Efficient Framework for Training Audio-Visual Controls

Summary

This paper introduces a new way to control the creation of videos and audio, allowing for changes based on things like depth, body poses, camera movements, and sound. It's about making it easier to tell a computer *exactly* how to generate the video or audio you want.

What's the problem?

Currently, creating videos and audio with specific controls is difficult. Existing methods either require training one huge, complicated model for a limited set of controls, or they need significant changes to the model's structure every time you want to add a new control like depth or camera angle. This is inefficient and doesn't scale well – meaning it gets harder and harder to add more control options.

What's the solution?

The researchers developed a system called AVControl. It builds upon an existing audio-visual model and uses a technique called LoRA. Essentially, instead of changing the main model, they add small, separate 'modules' for each control you want to use. These modules are trained independently and communicate with the main model through a 'parallel canvas' which adds information to the model's attention mechanism. This means you can add new controls without altering the core model, making it much more flexible.

Why it matters?

This work is important because it makes controlling video and audio generation much more practical. It's more efficient in terms of computing power and the amount of data needed to train the system. It also allows for a wider range of controls to be used simultaneously, opening up possibilities for more creative and precise video and audio creation. Plus, they've made their code and trained modules publicly available, so others can build upon their work.

Abstract

Controlling video and audio generation requires diverse modalities, from depth and pose to camera trajectories and audio transformations, yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality. We introduce AVControl, a lightweight, extendable framework built on LTX-2, a joint audio-visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image-based in-context methods to video fails for structural control, and that our parallel canvas approach resolves this. On the VACE Benchmark, we outperform all evaluated baselines on depth- and pose-guided generation, inpainting, and outpainting, and show competitive results on camera control and audio-visual benchmarks. Our framework supports a diverse set of independently trained modalities: spatially-aligned controls such as depth, pose, and edges, camera trajectory with intrinsics, sparse motion control, video editing, and, to our knowledge, the first modular audio-visual controls for a joint generation model. Our method is both compute- and data-efficient: each modality requires only a small dataset and converges within a few hundred to a few thousand training steps, a fraction of the budget of monolithic alternatives. We publicly release our code and trained LoRA checkpoints.

View Paper