AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis
Swapnil Bhosale, Haosen Yang, Diptesh Kanojia, Jiankang Deng, Xiatian Zhu
2024-06-17
Summary
This paper introduces a new model called Audio-Visual Gaussian Splatting (AV-GS) designed to improve how binaural audio (3D sound) is generated from a single audio source in a 3D environment. It focuses on creating more realistic sound experiences by considering the materials and geometry of the environment.
What's the problem?
Current methods for generating binaural audio often rely on complex models that can be slow and inefficient. These methods struggle to accurately represent the entire environment, including how sound interacts with different materials (like walls or furniture) and the layout of the space. This means that the audio produced may not sound realistic or may not change appropriately based on the listener's position in the room.
What's the solution?
To solve these problems, the authors developed the AV-GS model, which learns to represent a 3D scene using points that capture both material properties and geometry. They use a technique that adjusts how these points are distributed based on their importance for sound propagation. For example, they place more points in areas where sound might bounce off surfaces, like walls, to create a more accurate audio experience. This model allows for better adaptation of the audio to different viewpoints, making it more realistic.
Why it matters?
This research is important because it enhances the way we can create and experience sound in virtual environments. By improving how binaural audio is synthesized, AV-GS can lead to better applications in areas like virtual reality, gaming, and film, where realistic sound is crucial for immersion and user experience.
Abstract
Novel view acoustic synthesis (NVAS) aims to render binaural audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene. Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing binaural audio. However, in addition to low efficiency originating from heavy NeRF rendering, these methods all have a limited ability of characterizing the entire scene environment such as room geometry, material properties, and the spatial relation between the listener and sound source. To address these issues, we propose a novel Audio-Visual Gaussian Splatting (AV-GS) model. To obtain a material-aware and geometry-aware condition for audio synthesis, we learn an explicit point-based scene representation with an audio-guidance parameter on locally initialized Gaussian points, taking into account the space relation from the listener and sound source. To make the visual scene model audio adaptive, we propose a point densification and pruning strategy to optimally distribute the Gaussian points, with the per-point contribution in sound propagation (e.g., more points needed for texture-less wall surfaces as they affect sound path diversion). Extensive experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.