MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

Xingjian Zhao, Zhe Xu, Qinyuan Cheng, Zhaoye Fei, Luozhijie Jin, Yang Wang, Hanfu Chen, Yaozhou Jiang, Qinghui Gao, Ke Chen, Ruixiao Li, Mingshu Chen, Ruiming Wang, Wenbo Zhang, Yiyang Zhang, Donghua Yu, Yang Gao, Xiaogui Yang, Yitian Gong, Yuanfan Xu, Yaqian Zhou, Xuanjing Huang

2025-10-07

MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

Summary

This paper introduces MOSS-Speech, a new type of speech-to-speech system powered by a large language model that works directly with audio, bypassing the need to convert speech to text and back again.

What's the problem?

Traditional spoken dialogue systems break down speech into steps – first converting it to text, then processing that text, and finally turning the processed text back into speech. This process loses important details in the original speech, like tone of voice and emotion, and can also be slow. Newer systems try to avoid this by going straight from speech to speech, but they still usually rely on text as an intermediate step, which limits how well they can perform. Essentially, there's a bottleneck created by always needing to translate speech into text.

What's the solution?

The researchers created MOSS-Speech, which is designed to understand and generate speech directly, without ever needing to use text as a middle step. They did this by cleverly combining two techniques: a way to separate the different parts of the language model to handle speech specifically, and a method of building on existing, powerful text-based language models without retraining them from scratch. This allows MOSS-Speech to leverage the knowledge already built into these models while adding the ability to work directly with audio.

Why it matters?

This work is important because it moves us closer to creating more natural and efficient speech-based interactions. By removing the need for text translation, MOSS-Speech can preserve more of the nuances of speech and potentially respond faster. It shows a new way to build speech systems that are more expressive and can better understand and react to how we actually speak, bridging the performance gap between systems that use text and those that work directly with speech.

Abstract

Spoken dialogue systems often rely on cascaded pipelines that transcribe, process, and resynthesize speech. While effective, this design discards paralinguistic cues and limits expressivity. Recent end-to-end methods reduce latency and better preserve these cues, yet still rely on text intermediates, creating a fundamental bottleneck. We present MOSS-Speech, a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance. Our approach combines a modality-based layer-splitting architecture with a frozen pre-training strategy, preserving the reasoning and knowledge of pretrained text LLMs while adding native speech capabilities. Experiments show that our model achieves state-of-the-art results in spoken question answering and delivers comparable speech-to-speech performance relative to existing text-guided systems, while still maintaining competitive text performance. By narrowing the gap between text-guided and direct speech generation, our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.

View Paper