USAD: Universal Speech and Audio Representation via Distillation
Heng-Jui Chang, Saurabhchand Bhati, James Glass, Alexander H. Liu
2025-06-25
Summary
This paper talks about USAD, a method that creates a single AI model capable of understanding different types of audio like speech, music, and sounds by learning from several specialized models.
What's the problem?
The problem is that most existing audio models focus on specific types of audio, such as only speech or only music, which makes it hard to build one model that works well across all audio types.
What's the solution?
The researchers used a technique called layer-to-layer distillation to train one model by teaching it to copy important features from two expert models, one for speech and one for general audio, but they only matched some layers instead of all to keep training efficient.
Why it matters?
This matters because having one powerful model that understands all kinds of audio makes it easier and faster to build applications in speech recognition, music analysis, and sound detection, improving performance across many tasks with just one system.
Abstract
USAD integrates diverse audio types using efficient layer-to-layer distillation from domain-specific models, achieving competitive performance across various benchmarks with a single encoder.