A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data

Cheng Kang Chou, Chan-Jan Hsu, Ho-Lam Chung, Liang-Hsuan Tseng, Hsi-Chun Cheng, Yu-Kuan Fu, Kuan Po Huang, Hung-Yi Lee

2025-06-16

A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data

Summary

This paper talks about a self-refining framework that improves automatic speech recognition (ASR) systems by using unlabeled speech data. It starts with an existing ASR model that creates rough transcriptions (pseudo-labels) from speech without any transcripts. Then, it trains a text-to-speech (TTS) system using these pseudo-labeled pairs to make synthetic speech. This synthetic speech and its text are then used to train the ASR model again, creating a cycle where the system improves itself.

What's the problem?

The problem is that many speech datasets don’t have labeled transcripts, which makes it hard to train or improve ASR systems. Without lots of labeled data, ASR models don’t perform well on specific languages or in tricky situations like mixing languages. Existing methods also struggle to use unlabeled data effectively to boost performance.

What's the solution?

The solution was to build a self-improving loop that starts with an initial ASR model generating pseudo-labels on unlabeled speech. These labels train a high-quality TTS system that produces synthetic audio from text. This synthetic data is combined with real unlabeled data to retrain the ASR model, making it more specialized and accurate. This process allows continuous improvement without needing manually labeled data.

Why it matters?

This matters because it helps build better speech recognition systems using large amounts of unlabeled data, which is easier to get than labeled data. It makes it practical to create improved ASR models for languages or domains where labeled data is scarce, and it can greatly reduce errors especially in complex cases like Mandarin or mixed-language speech. Ultimately, this approach can make voice technology more accurate and accessible worldwide.

Abstract

A self-refining framework enhances ASR performance using unlabeled datasets by integrating pseudo-labeling, TTS, and synthesized speech to create a specialized model.

View Paper