Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap
KiHyun Nam, Jongmin Choi, Hyeongkeun Lee, Jungwoo Heo, Joon Son Chung
2025-10-15
Summary
This paper focuses on improving how computers understand the relationship between audio and text, specifically when using powerful language models. It introduces a new technique called Diffusion-Link to better connect information from audio and text sources.
What's the problem?
Currently, even though we have good systems for processing audio and text separately, there's a disconnect when trying to combine them, especially when linking them to large language models. This 'modality gap' means the audio and text information aren't understood in the same way, limiting how well these systems can perform tasks like describing sounds. Essentially, the computer struggles to translate audio 'thoughts' into text 'thoughts'.
What's the solution?
The researchers developed Diffusion-Link, which acts like a translator between audio and text. It takes the audio information and subtly changes it, using a process inspired by how images are created through diffusion, so it more closely resembles how text is represented. It's a relatively small addition to existing systems, consisting of a few processing layers, and it's applied *after* the initial audio and text processing but *before* feeding the information to the language model. This 'bridges' the gap between the two types of data.
Why it matters?
This work is important because it shows that actively reducing the gap between audio and text understanding is crucial for building better multimodal AI systems. By improving this connection, they achieved significant improvements in automatically generating captions for audio, even surpassing previous state-of-the-art results without needing extra information. It suggests a new direction for research, moving beyond simply retrieving relevant information and towards a more fundamental understanding of how to combine different types of data.
Abstract
Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that generatively maps audio embeddings into the text-embedding distribution. The module is trained at the output embedding from the frozen multimodal encoder and implemented as a lightweight network with three residual MLP blocks. To assess the effect of Diffusion-Link on multimodal encoder-LLM coupling, we evaluate on Automatic Audio Captioning (AAC); to our knowledge, this is the first application of diffusion-based modality bridging to AAC. We report two results. (1) Modality-gap analysis: on similarity and geometric criteria, Diffusion-Link reduces the modality gap the most among prior diffusion-based methods and shows a collective migration of audio embeddings toward the text distribution. (2) Downstream AAC: attaching Diffusion-Link to the same multimodal LLM baseline achieves state-of-the-art on AudioCaps in both zero-shot and fully supervised captioning without external knowledge, with relative gains up to 52.5% and 7.5%, respectively. These findings show that closing the modality gap is pivotal for effective coupling between multimodal encoders and LLMs, and diffusion-based modality bridging offers a promising direction beyond knowledge-retrieval-centric designs. Code will be released upon acceptance https://github.com/DevKiHyun/Diffusion-Link