Voxtral TTS
Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg, Andy Lo, Chen-Yo Sun, Guillaume Lample, Henry Lagarde, Jean-Malo Delignon, Jaeyoung Kim, John Harvill, Khyathi Raghavi Chandu, Lorenzo Signoretti, Margaret Jennings, Patrick von Platen, Pavankumar Reddy Muddireddy, Rohin Arora, Sanchit Gandhi, Samuel Humeau, Soham Ghosh, Srijan Mishra, Van Phung, Abdelaziz Bounhar
2026-03-28
Summary
This paper introduces Voxtral TTS, a new computer program that can create realistic-sounding speech in multiple languages using only a very short sample of someone's voice – as little as three seconds!
What's the problem?
Existing text-to-speech systems often require a lot of audio data to accurately mimic someone's voice, and sometimes the resulting speech doesn't sound very natural or expressive, especially when trying to clone voices across different languages. It's hard to get a system that's both quick to adapt to a new voice *and* produces high-quality, natural-sounding speech.
What's the solution?
The researchers built Voxtral TTS using a clever combination of techniques. First, they created a 'speech tokenizer' called Voxtral Codec that breaks down speech into smaller pieces. Then, they used two different methods to generate speech: one to create the overall meaning and flow of the speech, and another to create the specific sounds. This hybrid approach allows the system to learn and generate speech efficiently from limited audio data.
Why it matters?
This work is important because it makes voice cloning much more accessible. Because Voxtral TTS needs so little audio, it opens up possibilities for creating personalized speech experiences for people who don't have access to large amounts of recording equipment or time. It also performs better than other popular systems like ElevenLabs, meaning it creates more natural and expressive speech, which is a big step forward for text-to-speech technology.
Abstract
We introduce Voxtral TTS, an expressive multilingual text-to-speech model that generates natural speech from as little as 3 seconds of reference audio. Voxtral TTS adopts a hybrid architecture that combines auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens. These tokens are encoded and decoded with Voxtral Codec, a speech tokenizer trained from scratch with a hybrid VQ-FSQ quantization scheme. In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4\% win rate over ElevenLabs Flash v2.5. We release the model weights under a CC BY-NC license.