Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano, Gerard I. Gállego, Jorge Iranzo-Sánchez, Ahrii Kim, Dominik Macháček, Patricia Schmidtova, Maike Züfle
2025-12-19
Summary
This research investigates whether new 'SpeechLLMs' – AI models that directly translate spoken language – are better at speech translation than the traditional method of first converting speech to text and *then* translating it.
What's the problem?
Currently, speech translation usually happens in two steps: first, a speech recognition system converts spoken words into text, and then a language translation model translates that text. Recently, researchers have been developing SpeechLLMs that try to do both steps *at the same time*, directly translating speech into another language. However, it wasn't clear if these new SpeechLLMs actually perform better than the older, two-step approach, especially in difficult real-world situations like noisy environments or with people who don't speak perfectly clearly.
What's the solution?
The researchers created a large and thorough testing system called 'Hearing to Translate'. They tested five different SpeechLLMs against sixteen established translation systems (some using the traditional two-step method and others using different combinations of speech and language models). They tested these systems on a wide variety of speech samples – sixteen different tests, thirteen language combinations, and nine challenging conditions like background noise, unclear speech, and long recordings – to see how well each system performed.
Why it matters?
The findings show that the traditional two-step systems are still generally more reliable for speech translation. While SpeechLLMs show promise in certain situations, they haven't yet surpassed the performance of established methods. This highlights the importance of using powerful language models, either as part of the SpeechLLM itself or in the traditional pipeline, to achieve high-quality speech translation.
Abstract
As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which aim to translate spoken language directly, thereby bypassing traditional transcription-based pipelines. Whether this integration improves speech-to-text translation quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable overall, while current SpeechLLMs only match cascades in selected settings and SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.