Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering
Neta Glazer, Lenny Aharon, Ethan Fetaya
2026-03-11
Summary
This paper investigates why large AI models that handle both text and other types of data, like audio, sometimes ignore the non-text information even when it's important, and proposes a way to fix this without retraining the model.
What's the problem?
Large language models that work with multiple types of data, specifically audio and text, often rely too much on the text part of the input. This means they might miss crucial information present in the audio, leading to incorrect predictions. Essentially, the model doesn't 'listen' effectively, even when the audio provides a clear answer.
What's the solution?
Researchers used a technique called 'mechanistic interpretability' to pinpoint specific parts of the model – certain 'attention heads' – that are responsible for processing audio. They found that when these audio-focused parts of the model are actively engaged, it means the model is actually using the audio information. Then, they figured out how to subtly boost the activity of these audio-processing parts during the model's calculations, making it pay more attention to the audio. This 'boost' was applied *after* the model was already trained, so no new training was needed.
Why it matters?
This work is important because it shows us how to improve the reliability of AI models that handle multiple types of data. By making these models better at integrating information from different sources, like audio and text, we can build more accurate and trustworthy AI systems. The fact that this improvement doesn't require retraining the model is a big deal, as retraining can be very expensive and time-consuming.
Abstract
Multimodal large language models can exhibit text dominance, over-relying on linguistic priors instead of grounding predictions in non-text inputs. One example is large audio-language models (LALMs) where decisive audio evidence can be under-utilized even when it contains important information. To address this issue we use mechanistic interpretability to identify a small set of audio-specialist attention heads whose audio attention yields a ``listening'' signal. We show that this signal increases when audio evidence affects the model's output, providing an indicator of audio engagement under standard prompting. Leveraging this localization, we construct an audio--silence steering direction and apply an inference-time activation intervention to the final representation, amplifying the model's audio effect. To demonstrate the utility of this intervention, we show on MMAU that this improves accuracy by up to +8.0 percentage points on two Qwen-based LALMs, without any parameter updates.