From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition

Tianduo Wang, Lu Xu, Wei Lu, Shanbo Cheng

2025-05-27

From Tens of Hours to Tens of Thousands: Scaling Back-Translation for
Speech Recognition

Summary

This paper talks about a new way to improve speech recognition systems, especially for different languages, by using a technique called speech back-translation. This method creates fake but realistic speech from written text and uses it to train the system, helping it better understand and transcribe spoken words.

What's the problem?

The problem is that speech recognition systems need a lot of real spoken data to learn from, but collecting and transcribing this data for many languages takes a huge amount of time and effort. Without enough training data, these systems make more mistakes, especially in languages that don't have a lot of resources.

What's the solution?

The authors use speech back-translation, which means they take large collections of written text and use AI to turn that text into high-quality, synthetic speech. This new speech data is then used to train the speech recognition system, making it much better at understanding spoken language and reducing errors.

Why it matters?

This is important because it makes speech recognition more accurate and available for many more languages, even those with little real-world data. It helps create better voice assistants, transcription services, and language tools for people all around the world.

Abstract

Speech Back-Translation enhances multilingual ASR systems by generating high-quality synthetic speech from text corpora, significantly reducing transcription errors.

View Paper