Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost
Venkata Pushpak Teja Menta
2026-04-30
Summary
This paper focuses on improving the quality of text-to-speech (TTS) systems for Indian languages like Telugu, Tamil, and Hindi. They aimed to get an existing, non-Indian language TTS model to perform as well as professional, commercial systems without extensive retraining or using commercial data.
What's the problem?
Current open-source TTS systems for Indian languages aren't as good as the commercial options available. A popular multilingual model, Chatterbox, doesn't even properly handle the writing systems for Telugu and Tamil. Building high-quality TTS requires a lot of data and training, which is a significant hurdle for these languages.
What's the solution?
The researchers took Chatterbox, a TTS model originally designed for other languages, and made three key changes. First, they used a system called BUPS to convert the Indian language scripts into a format Chatterbox could understand. Second, they trained a small add-on module (LoRA) using about 1,220 hours of Indian language audio, focusing on making the model recognize the correct language. Finally, they figured out a way to 'guide' the model's voice output using short reference clips and some adjustments to the sampling process, improving the sound quality without changing the core TTS engine. For Hindi, they found it better to use the original Chatterbox with just the voice-prompting technique.
Why it matters?
This work is important because it shows you can significantly improve TTS quality for Indian languages with relatively little effort and data. By adapting an existing model instead of building one from scratch, they’ve lowered the barrier to entry for creating high-quality speech synthesis for these languages, and their methods are openly available for others to use and build upon.
Abstract
Commercial TTS systems produce near-native Indic audio, but the best open-source bases (Chatterbox, Indic Parler-TTS, IndicF5) trail them on measured phonological dimensions, and the most widely adopted multilingual base (Chatterbox, 23 languages) does not even tokenise Telugu or Tamil. We ask: what is the minimum intervention that brings such a non-Indic-native base to commercial-class output on Telugu, Tamil, and Hindi, without training a new acoustic decoder and without any commercial TTS training data? We combine three pieces: (1) BUPS, a Brahmic Unified Phoneme Space that deterministically romanises seven Indic scripts to ISO-15919 so Chatterbox's Latin tokeniser can process them; (2) a LoRA adapter on only the text-token predictor (Chatterbox's t3), trained on ~1,220h of licensed Indic audio with a Hindi-proxy language_id; (3) a voice-prompt recovery recipe -- an 8-11s same-language reference clip plus three sampling overrides (exaggeration 0.7, temperature 0.6, min_p 0.1; "Config B") -- that recovers commercial-class acoustic output with no acoustic-decoder training. On Hindi, the LoRA regresses accuracy and we instead use vanilla Chatterbox + Config B, giving a two-branch deployment. Evaluated on 10-utterance pilot sets with the companion PSP benchmark, Praxy Voice matches or slightly leads commercial baselines: 26.7% retroflex collapse on Telugu (vs Sarvam Bulbul 33.3%), 71% Tamil-zha collapse (vs commercial trio's 86%), 0.025 LLM-WER on Hindi (tied with Cartesia Sonic-3). For intra-sentential code-mix we add a third branch (IndicF5 + native-script transliteration) that drops code-mix LLM-WER from 0.80-0.85 to 0.14-0.27 across Hi/Te/Ta. We release R6 LoRA weights (Apache-2.0), inference code and router (MIT), and a Gradio demo.