Qwen3-TTS Technical Report
Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, Xinyu Zhang, Pei Zhang, Baosong Yang, Jin Xu, Jingren Zhou, Junyang Lin
2026-01-23
Summary
This paper introduces Qwen3-TTS, a new set of text-to-speech models that are really good at handling multiple languages, letting you control how the speech sounds, being reliable, and working in real-time.
What's the problem?
Creating realistic and controllable text-to-speech systems is hard, especially when you want them to work well in many different languages and be fast enough for things like real-time conversations. Existing systems often struggle with either quality, control, speed, or a combination of these issues. It's also difficult to quickly create new voices or modify existing ones without a lot of effort.
What's the solution?
The researchers developed Qwen3-TTS, which uses a clever design with two different 'tokenizers' – think of them as ways to break down sound into manageable pieces. One tokenizer focuses on the meaning of the speech, while the other prioritizes speed and low latency. They trained the models on a massive dataset of over 5 million hours of speech in 10 languages. This allows for quick voice cloning (copying a voice in just 3 seconds) and detailed control over the speech's characteristics, like tone and emotion. The system is designed to work in real-time, meaning it can generate speech as you type or speak.
Why it matters?
This work is important because it pushes the boundaries of what's possible with text-to-speech technology. The ability to create high-quality, controllable, and multilingual speech systems has a lot of potential applications, like more natural-sounding virtual assistants, improved accessibility tools for people with disabilities, and more engaging educational software. By releasing the models and tokenizers publicly, the researchers are also helping to accelerate further research and development in this field.
Abstract
In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission (97,ms) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.