AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

Georgii Aparin, Tasnima Sadekova, Alexey Rukhovich, Assel Yermekova, Laida Kushnareva, Vadim Popov, Kristian Kuznetsov, Irina Piontkovskaya

2026-02-09

AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

Summary

This research explores how a specific type of artificial intelligence, called Sparse Autoencoders, can help us understand what's happening inside complex audio processing models like Whisper and HuBERT, which are used for speech recognition and understanding.

What's the problem?

While Sparse Autoencoders are good at explaining how other AI models work, they haven't been used much with audio data. This means we don't fully understand how these audio models 'think' or what features they focus on when processing sound. It's hard to interpret what the models are learning and potentially fix issues like misinterpreting sounds.

What's the solution?

The researchers trained these Sparse Autoencoders on the inner workings of Whisper and HuBERT, looking at each layer of the models. They then tested how consistent the Autoencoders were across different runs and how well they could reconstruct the original audio. They also investigated if the Autoencoders could identify specific sounds, like laughter or background noise, and even if the features they learned matched human brain activity when listening to speech. Finally, they used the Autoencoders to improve Whisper's accuracy by reducing false detections of speech.

Why it matters?

This work is important because it gives us a way to peek inside 'black box' audio models and understand what they're learning. This understanding can help us improve these models, make them more reliable, and even build AI that processes audio more like humans do. The ability to reduce false speech detections and the correlation with human brain activity suggest a path towards more natural and accurate AI audio processing.

Abstract

Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, provide an extensive evaluation of their stability, interpretability, and show their practical utility. Over 50% of the features remain consistent across random seeds, and reconstruction quality is preserved. SAE features capture general acoustic and semantic information as well as specific events, including environmental noises and paralinguistic sounds (e.g. laughter, whispering) and disentangle them effectively, requiring removal of only 19-27% of features to erase a concept. Feature steering reduces Whisper's false speech detections by 70% with negligible WER increase, demonstrating real-world applicability. Finally, we find SAE features correlated with human EEG activity during speech perception, indicating alignment with human neural processing. The code and checkpoints are available at https://github.com/audiosae/audiosae_demo.

View Paper