Key Features

Long-form streaming TTS system for multi-speaker dialogue generation
Supports multiple languages including English, Chinese, Japanese, Korean, French, German, and Russian
Ultra-low latency with 12.5Hz streaming speech tokenizer
Dual-transformer architecture for flexible sentence-by-sentence generation
High similarity and low WER/CER in both monologue and dialogue tests
Random timbre generation for creating ASR/speech interaction data
Zero-shot voice cloning for cross-lingual and code-switching scenarios
Web UI tool for easy dialogue generation

The system features ultra-low latency, building on the new 12.5Hz streaming speech tokenizer, and employs a dual-transformer architecture that operates on a text–speech interleaved sequence, enabling flexible sentence-by-sentence generation and reducing first-packet latency. Specifically, on an L20 GPU, the first-packet latency is as low as 140ms while maintaining high-quality audio output. The system also achieves high similarity and low WER/CER in both monologue and dialogue tests.


FireRedTTS-2 is useful for creating ASR/speech interaction data and features random timbre generation. The system can be used for various applications such as podcast generation, chatbot development, and language learning. The system also supports zero-shot voice cloning for cross-lingual and code-switching scenarios. Additionally, the system provides a web UI tool for easy dialogue generation and supports both voice cloning and randomized voices.

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme
TurboType Banner

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!