Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation

Jianzong Wu, Hao Lian, Dachao Hao, Ye Tian, Qingyu Shi, Biaolong Chen, Hao Jiang

2025-12-03

Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation

Summary

This research explores whether training a system to generate both audio and video together can actually improve the quality of the video itself, even if you don't care about the audio at all.

What's the problem?

Current systems that create audio and video often do so together, and it's assumed this mainly helps keep the audio and video synchronized. However, it wasn't clear if this joint training actually made the *video* better on its own, beyond just making the sounds match what's happening on screen. The question is: does adding audio information during the learning process help the system understand and generate more realistic video?

What's the solution?

The researchers created a new system called AVFullDiT that combines pre-existing tools for turning text into video and text into audio. They then trained two versions of this system: one that generates audio and video together, and another that *only* generates video. Both systems were given the same instructions and starting point. By comparing the results, they could see if the audio-video training made a difference in video quality.

Why it matters?

The results showed that training with audio *did* improve the video, especially in scenes with complex movements like objects colliding. The idea is that predicting the sound forces the system to learn how things in the real world cause sounds, which then helps it create more realistic and physically accurate video. This suggests that training systems with multiple types of information (like audio and video) could lead to better AI that understands the world around us more deeply.

Abstract

Recent audio-video generative systems suggest that coupling modalities benefits not only audio-video synchrony but also the video modality itself. We pose a fundamental question: Does audio-video joint denoising training improve video generation, even when we only care about video quality? To study this, we introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising. We train (i) a T2AV model with AVFullDiT and (ii) a T2V-only counterpart under identical settings. Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony. We observe consistent improvements on challenging subsets featuring large and object contact motions. We hypothesize that predicting audio acts as a privileged signal, encouraging the model to internalize causal relationships between visual events and their acoustic consequences (e.g., collision times impact sound), which in turn regularizes video dynamics. Our findings suggest that cross-modal co-training is a promising approach to developing stronger, more physically grounded world models. Code and dataset will be made publicly available.

View Paper