Pseudo2Real: Task Arithmetic for Pseudo-Label Correction in Automatic Speech Recognition

Yi-Cheng Lin, Yu-Hsuan Li Liang, Hsuan Su, Tzu-Quan Lin, Shang-Tse Chen, Yun-Nung Chen, Hung-yi Lee

2025-10-13

Pseudo2Real: Task Arithmetic for Pseudo-Label Correction in Automatic Speech Recognition

Summary

This paper focuses on improving the accuracy of automatic speech recognition (ASR) systems when they're used with accents or in situations different from what they were originally trained on, a problem known as 'domain shift'.

What's the problem?

ASR systems often struggle when faced with new accents or environments because getting enough labeled speech data for every possibility is expensive and time-consuming. A common trick, called 'pseudo-labeling', tries to get around this by automatically labeling data, but it often introduces consistent errors specific to certain accents, and simply filtering out bad labels doesn't fully solve the issue. The core problem is how to fix these recurring biases in the automatically labeled data without having access to correct, human-verified labels for the new accents.

What's the solution?

The researchers came up with a clever way to correct these biases by comparing two ASR models. Both models start with the same basic settings, but one is trained on real, verified speech data, and the other is trained on the automatically labeled data. The difference in the settings (parameters) between these two models reveals the biases introduced by the automatic labeling process. This difference is then applied as a correction to another model trained on the automatically labeled data, improving its accuracy. They used a relatively small ASR model called Whisper and tested it on speech from ten different African accents.

Why it matters?

This work is important because it offers a practical way to improve speech recognition for a wider range of people and situations, especially in areas where collecting large amounts of labeled speech data is difficult. By reducing errors in accents that are often underrepresented in training data, it makes ASR technology more inclusive and useful for everyone, achieving significant improvements in accuracy – up to 35% in their experiments.

Abstract

Robust ASR under domain shift is crucial because real-world systems encounter unseen accents and domains with limited labeled data. Although pseudo-labeling offers a practical workaround, it often introduces systematic, accent-specific errors that filtering fails to fix. We ask: How can we correct these recurring biases without target ground truth? We propose a simple parameter-space correction: in a source domain containing both real and pseudo-labeled data, two ASR models are fine-tuned from the same initialization, one on ground-truth labels and the other on pseudo-labels, and their weight difference forms a correction vector that captures pseudo-label biases. When applied to a pseudo-labeled target model, this vector enhances recognition, achieving up to a 35% relative Word Error Rate (WER) reduction on AfriSpeech-200 across ten African accents with the Whisper tiny model.

View Paper