Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMS

Anand, Umberto Cappellazzo, Stavros Petridis, Maja Pantic

2025-10-28

Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMS

Summary

This research investigates how large language models, which are getting really good at understanding speech from both audio and video, actually *work* internally when they're being specifically trained for speech recognition tasks. It focuses on identifying unusual patterns in how the model pays attention to different parts of the input.

What's the problem?

While these language models are improving speech recognition, we don't fully understand what's happening 'under the hood' during the training process. The researchers noticed that in regular language tasks, certain 'sink' tokens—basically, specific parts of the input—grab a lot of the model's attention and become super activated. They wanted to see if this same thing happens when the models are used for understanding speech from audio, video, or both, and if it causes problems.

What's the solution?

The researchers found these 'sink' tokens *do* appear in speech recognition models, not just at the beginning of a sentence, but also in the middle. They discovered these sinks are linked to specific, unchanging patterns within the model's calculations, and they're similar to the beginning-of-sentence token, which amplifies their effect. To fix this, they created a new training method that discourages these intermediate tokens from being so similar to the beginning token, essentially spreading out the model's attention.

Why it matters?

This work is important because understanding these internal dynamics helps us build better speech recognition systems. By reducing these 'sink' tokens and their associated high activations, the researchers were able to improve the model's accuracy, especially when the audio and video quality are reduced, making the technology more robust in real-world situations.

Abstract

Large language models (LLMs) have recently advanced auditory speech recognition (ASR), visual speech recognition (VSR), and audio-visual speech recognition (AVSR). However, understanding of their internal dynamics under fine-tuning remains limited. In natural language processing, recent work has revealed attention sinks, tokens that attract disproportionately high attention, and associated massive activations in which some features of sink tokens exhibit huge activation in LLMs. In this work, we are the first to study these phenomena in multimodal speech recognition. Through a detailed analysis of audio-visual LLMs, we identify attention sinks and massive activations not only at the BOS token but also at intermediate low-semantic tokens across ASR, VSR, and AVSR. We show that massive activations originate in the MLP layers and correspond to fixed feature indices across all sink tokens. We further show that intermediate sink tokens exhibit high cosine similarity to the BOS token, thereby amplifying attention and activation. Building on these insights, we introduce a simple decorrelation loss that reduces cosine similarity between BOS and other tokens, effectively mitigating intermediate sinks and massive activations. Furthermore, our method improves word error rate (WER) under high audio-visual feature downsampling while remaining stable at lower downsampling rates.

View Paper