In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties
Nathan Roll, Calbert Graham, Yuka Tatsumi, Kim Tien Nguyen, Meghan Sumner, Dan Jurafsky
2025-05-22
Summary
This paper talks about how using in-context learning, which means giving an AI a few examples to learn from right away, helps speech recognition systems like Phi-4 Multimodal get much better at understanding different speakers and accents, almost like how people do.
What's the problem?
Speech recognition AIs often struggle to accurately understand people who have different accents or speaking styles, especially if they haven't been trained on those specific voices, which makes them less reliable in real-life situations.
What's the solution?
The researchers showed that by giving the AI just a few sample sentences from a new speaker or language variety, it can quickly adapt and improve its accuracy, making its performance much more like a human listener who can adjust to new voices on the fly.
Why it matters?
This matters because it means speech recognition technology can become more flexible and dependable for everyone, no matter how they speak, making things like voice assistants and transcription services more useful and fair.
Abstract
In-context learning in Phi-4 Multimodal demonstrates significant improvements in automatic speech recognition robustness with a small number of example utterances, showing a performance profile similar to human listeners.