How to Evaluate Speech Translation with Source-Aware Neural MT Metrics

Mauro Cettolo, Marco Gaido, Matteo Negri, Sara Papi, Luisa Bentivogli

2025-11-07

How to Evaluate Speech Translation with Source-Aware Neural MT Metrics

Summary

This paper investigates how to automatically assess the quality of speech-to-text translation (ST) systems, which convert spoken language into written text in another language.

What's the problem?

Currently, ST systems are usually evaluated by comparing their output to human-created 'gold standard' translations. However, this method doesn't consider the original spoken input, which contains valuable information. While improvements in machine translation use the original text to better judge quality, applying this to ST is difficult because the original source is audio, not text, and getting accurate transcripts of the audio is often expensive or impossible.

What's the solution?

The researchers explored ways to use the original audio, or approximations of it, to evaluate ST systems. They tested two main approaches: using automatic speech recognition (ASR) to create a text transcript of the audio, and 'back-translating' the reference translation (translating it back to the original language). They also developed a technique to better align these synthetic audio representations with the reference translations, which helps the evaluation metrics work more effectively. They tested these methods on a large dataset covering many language pairs and different ST systems.

Why it matters?

This work is important because it proposes more accurate ways to evaluate speech translation systems. By incorporating information from the original audio, even if it's an approximation, the evaluation becomes more reliable and better reflects how humans would judge the quality of the translation. This ultimately helps developers build better speech translation technology.

Abstract

Automatic evaluation of speech-to-text translation (ST) systems is typically performed by comparing translation hypotheses with one or more reference translations. While effective to some extent, this approach inherits the limitation of reference-based evaluation that ignores valuable information from the source input. In machine translation (MT), recent progress has shown that neural metrics incorporating the source text achieve stronger correlation with human judgments. Extending this idea to ST, however, is not trivial because the source is audio rather than text, and reliable transcripts or alignments between source and references are often unavailable. In this work, we conduct the first systematic study of source-aware metrics for ST, with a particular focus on real-world operating conditions where source transcripts are not available. We explore two complementary strategies for generating textual proxies of the input audio, automatic speech recognition (ASR) transcripts, and back-translations of the reference translation, and introduce a novel two-step cross-lingual re-segmentation algorithm to address the alignment mismatch between synthetic sources and reference translations. Our experiments, carried out on two ST benchmarks covering 79 language pairs and six ST systems with diverse architectures and performance levels, show that ASR transcripts constitute a more reliable synthetic source than back-translations when word error rate is below 20%, while back-translations always represent a computationally cheaper but still effective alternative. Furthermore, our cross-lingual re-segmentation algorithm enables robust use of source-aware MT metrics in ST evaluation, paving the way toward more accurate and principled evaluation methodologies for speech translation.

View Paper