SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation

Kien T. Pham, Yingqing He, Yazhou Xing, Qifeng Chen, Long Chen

2025-08-04

SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware
Video Generation

Summary

This paper talks about SpA2V, a new method that creates realistic videos based solely on audio, using the idea that sounds carry spatial information like where they come from and how they move.

What's the problem?

The problem is that most current methods only understand what sounds are present but don’t use the location or movement information from the sound, which is important to make videos that correctly match the audio in space.

What's the solution?

SpA2V solves this by first analyzing the audio to figure out both what objects are making the sounds and where they are located in the scene. It then creates a video layout that represents this spatial information and uses it to guide a video generation model, producing videos that closely match both the sound and its spatial properties.

Why it matters?

This matters because it helps create videos that better align with sounds, making AI systems more realistic and useful in applications like virtual reality, video production, and helping people visualize sounds.

Abstract

SpA2V generates realistic videos aligned with input audio by leveraging spatial auditory cues and integrating them into diffusion models through video scene layouts.

View Paper