Temporally Aligned Audio for Video with Autoregression

Ilpo Viertola, Vladimir Iashin, Esa Rahtu

2024-09-23

Temporally Aligned Audio for Video with Autoregression

Summary

This paper presents V-AURA, a new model designed to generate audio that matches video content closely. It uses advanced techniques to ensure that the sounds produced are well-timed and relevant to the visuals in the video.

What's the problem?

Creating audio that fits perfectly with video is challenging because it requires precise timing and relevance between what is seen and what is heard. Most existing methods struggle to maintain this alignment, leading to audio that feels disconnected from the visuals, which can ruin the viewer's experience.

What's the solution?

To tackle this issue, the researchers developed V-AURA, an autoregressive model that focuses on generating audio in sync with video. It uses a high-framerate visual feature extractor to capture detailed movements in the video and a special method to combine audio and visual features. They also created a new dataset called VisualSound, which includes videos where the sounds match the visuals closely. This helps train V-AURA to produce better results. The model has shown improved performance in aligning audio with video compared to other existing models.

Why it matters?

This research is important because it enhances how we create multimedia content by ensuring that audio and video work together seamlessly. This can significantly improve the quality of films, games, and online videos, making them more engaging and enjoyable for viewers.

Abstract

We introduce V-AURA, the first autoregressive model to achieve high temporal alignment and relevance in video-to-audio generation. V-AURA uses a high-framerate visual feature extractor and a cross-modal audio-visual feature fusion strategy to capture fine-grained visual motion events and ensure precise temporal alignment. Additionally, we propose VisualSound, a benchmark dataset with high audio-visual relevance. VisualSound is based on VGGSound, a video dataset consisting of in-the-wild samples extracted from YouTube. During the curation, we remove samples where auditory events are not aligned with the visual ones. V-AURA outperforms current state-of-the-art models in temporal alignment and semantic relevance while maintaining comparable audio quality. Code, samples, VisualSound and models are available at https://v-aura.notion.site

View Paper