The system features ultra-low latency, building on the new 12.5Hz streaming speech tokenizer, and employs a dual-transformer architecture that operates on a text–speech interleaved sequence, enabling flexible sentence-by-sentence generation and reducing first-packet latency. Specifically, on an L20 GPU, the first-packet latency is as low as 140ms while maintaining high-quality audio output. The system also achieves high similarity and low WER/CER in both monologue and dialogue tests.
FireRedTTS-2 is useful for creating ASR/speech interaction data and features random timbre generation. The system can be used for various applications such as podcast generation, chatbot development, and language learning. The system also supports zero-shot voice cloning for cross-lingual and code-switching scenarios. Additionally, the system provides a web UI tool for easy dialogue generation and supports both voice cloning and randomized voices.