LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

Sambal Shikhar, Mohammed Irfan Kurpath, Sahal Shaji Mullappilly, Jean Lahoud, Fahad Khan, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal

2025-03-07

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

Summary

This paper talks about LLMVoX, a new system that helps AI language models speak out loud in a natural way without changing how the AI thinks or writes

What's the problem?

Current methods for making AI language models speak often mess up the AI's ability to think and communicate well. They also use a lot of computer power and don't always match the text to the speech correctly

What's the solution?

The researchers created LLMVoX, a small add-on system that can work with any AI language model. It turns the AI's text into speech quickly and accurately without changing how the AI works. LLMVoX can handle long conversations and even works in different languages with just a little adjustment

Why it matters?

This matters because it makes AI assistants more versatile and natural to interact with. People can now have long, flowing conversations with AI that sound more human-like, without losing any of the AI's smarts. It also opens up possibilities for AI that can understand text, images, and speech all at once, making it more useful in real-world situations

Abstract

Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX supports seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with only dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training. Our code base and project page is available at https://mbzuai-oryx.github.io/LLMVoX .

View Paper