From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition
Tianduo Wang, Lu Xu, Wei Lu, Shanbo Cheng
2025-05-27
Summary
This paper talks about a new way to improve speech recognition systems, especially for different languages, by using a technique called speech back-translation. This method creates fake but realistic speech from written text and uses it to train the system, helping it better understand and transcribe spoken words.
What's the problem?
The problem is that speech recognition systems need a lot of real spoken data to learn from, but collecting and transcribing this data for many languages takes a huge amount of time and effort. Without enough training data, these systems make more mistakes, especially in languages that don't have a lot of resources.
What's the solution?
The authors use speech back-translation, which means they take large collections of written text and use AI to turn that text into high-quality, synthetic speech. This new speech data is then used to train the speech recognition system, making it much better at understanding spoken language and reducing errors.
Why it matters?
This is important because it makes speech recognition more accurate and available for many more languages, even those with little real-world data. It helps create better voice assistants, transcription services, and language tools for people all around the world.
Abstract
Speech Back-Translation enhances multilingual ASR systems by generating high-quality synthetic speech from text corpora, significantly reducing transcription errors.