EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
Yuhao Zhang, Yuhao Du, Zhanchen Dai, Xiangnan Ma, Kaiqi Kou, Benyou Wang, Haizhou Li
2025-09-12
Summary
This paper introduces a new approach to building speech-to-speech large language models, called EchoX, which aims to make them better at understanding and reasoning like their text-based counterparts.
What's the problem?
Current speech-to-speech models, while good at converting speech to speech, often lose some of the knowledge and reasoning skills that text-based models have. This happens because there's a disconnect between how the model understands the sounds of speech and the meaning behind those sounds – it struggles to connect what it *hears* with what it *knows*. Essentially, the model doesn't fully grasp the concepts being discussed when only given audio input.
What's the solution?
The researchers developed EchoX, which tries to bridge this gap by focusing on the meaning of the speech. Instead of just training the model to repeat sounds, EchoX generates its own training examples that emphasize the semantic content – the actual ideas being expressed. It combines learning from the acoustic features of speech with learning from the semantic representations, helping the model maintain strong reasoning abilities even when working with speech.
Why it matters?
This work is important because it improves the ability of speech-based AI to handle complex tasks that require understanding and reasoning, like answering questions based on spoken information. This could lead to more intelligent voice assistants and other speech-driven applications that aren't just good at recognizing words, but actually understand what's being said.
Abstract
Speech-to-speech large language models (SLLMs) are attracting increasing attention. Derived from text-based large language models (LLMs), SLLMs often exhibit degradation in knowledge and reasoning capabilities. We hypothesize that this limitation arises because current training paradigms for SLLMs fail to bridge the acoustic-semantic gap in the feature representation space. To address this issue, we propose EchoX, which leverages semantic representations and dynamically generates speech training targets. This approach integrates both acoustic and semantic learning, enabling EchoX to preserve strong reasoning abilities as a speech LLM. Experimental results demonstrate that EchoX, with about six thousand hours of training data, achieves advanced performance on multiple knowledge-based question-answering benchmarks. The project is available at https://github.com/FreedomIntelligence/EchoX.