ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation
Mengchen Zhang, Qi Chen, Tong Wu, Zihan Liu, Dahua Lin
2025-12-03
Summary
This paper focuses on creating realistic 3D sound from videos, specifically generating binaural audio which simulates how we naturally hear sounds in space.
What's the problem?
Currently, most systems that create sound from videos produce only a single audio channel, lacking a sense of direction or immersion. Existing methods that *do* try to create 3D sound do so in two steps: first making a regular mono audio track, and then adding spatial effects. This two-step process often leads to errors and inconsistencies, making the sound feel unnatural or disconnected from the video.
What's the solution?
The researchers introduced a new method called ViSAudio that directly generates 3D binaural audio from video in a single step. They also created a large dataset, BiAudio, containing nearly 100,000 videos paired with corresponding 3D audio recordings. ViSAudio uses a clever technique called 'conditional flow matching' with two separate audio generation pathways, and a module to ensure the left and right ear audio channels stay consistent while still sounding like they're coming from specific locations in the video.
Why it matters?
This work is important because it significantly improves the realism of audio generated from videos. By creating truly immersive 3D sound that accurately reflects the video's content and camera movements, it has the potential to enhance experiences in virtual reality, gaming, and other applications where realistic sound is crucial.
Abstract
Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio-temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio-project.