SoundReactor: Frame-level Online Video-to-Audio Generation
Koichi Saito, Julian Tanke, Christian Simon, Masato Ishii, Kazuki Shimada, Zachary Novack, Zhi Zhong, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji
2025-10-06
Summary
This paper introduces a new way to create audio from video in real-time, meaning it can generate sound as each frame of a video is shown, rather than needing the whole video at once.
What's the problem?
Current methods for turning video into audio require having the entire video available beforehand. This makes them unusable for things like live streaming, video games, or any application where you need sound *immediately* as the video plays. Imagine trying to create sound effects for a live video game stream – you can’t wait for the whole game to be recorded!
What's the solution?
The researchers developed a model called SoundReactor that generates audio frame-by-frame. It uses a vision component to understand what’s happening in each video frame and a sound-generating component that builds the audio piece by piece, ensuring the sound stays synchronized with the video. They used a specific type of neural network called a transformer, and a vision encoder called DINOv2, to make this happen efficiently and with high quality. They also used a training process involving diffusion and consistency to speed things up.
Why it matters?
This work is important because it opens the door to creating interactive and dynamic audio experiences. It allows for real-time sound generation for live content, virtual worlds, and potentially even assistive technologies. The low delay achieved by SoundReactor means the audio feels truly connected to the video, making the experience much more immersive and realistic.
Abstract
Prevailing Video-to-Audio (V2A) generation models operate offline, assuming an entire video sequence or chunks of frames are available beforehand. This critically limits their use in interactive applications such as live content creation and emerging generative world models. To address this gap, we introduce the novel task of frame-level online V2A generation, where a model autoregressively generates audio from video without access to future video frames. Furthermore, we propose SoundReactor, which, to the best of our knowledge, is the first simple yet effective framework explicitly tailored for this task. Our design enforces end-to-end causality and targets low per-frame latency with audio-visual synchronization. Our model's backbone is a decoder-only causal transformer over continuous audio latents. For vision conditioning, it leverages grid (patch) features extracted from the smallest variant of the DINOv2 vision encoder, which are aggregated into a single token per frame to maintain end-to-end causality and efficiency. The model is trained through a diffusion pre-training followed by consistency fine-tuning to accelerate the diffusion head decoding. On a benchmark of diverse gameplay videos from AAA titles, our model successfully generates semantically and temporally aligned, high-quality full-band stereo audio, validated by both objective and human evaluations. Furthermore, our model achieves low per-frame waveform-level latency (26.3ms with the head NFE=1, 31.5ms with NFE=4) on 30FPS, 480p videos using a single H100. Demo samples are available at https://koichi-saito-sony.github.io/soundreactor/.