Key Features

Context-Aware, Expressive Speech Generation
True-to-Life Voice Cloning
High-Efficiency Synthesis
Streaming Synthesis
Real-Time Factor (RTF) as low as 0.17
Fine-Grained Control over Speech Attributes
Support for Multiple Languages
End-to-End Diffusion Autoregressive Architecture

VoxCPM comprehends text to infer and generate appropriate prosody, delivering speech with remarkable expressiveness and natural flow. It spontaneously adapts speaking style based on content, producing highly fitting vocal expression trained on a massive 1.8 million-hour bilingual corpus. With only a short reference audio clip, VoxCPM performs accurate zero-shot voice cloning, capturing not only the speaker’s timbre but also fine-grained characteristics such as accent, emotional tone, rhythm, and pacing to create a faithful and natural replica.


VoxCPM supports streaming synthesis with a Real-Time Factor (RTF) as low as 0.17 on a consumer-grade NVIDIA RTX 4090 GPU, making it possible for real-time applications. The model is highly efficient and can be fine-tuned for specific use cases. VoxCPM achieves competitive results on public zero-shot TTS benchmarks, outperforming other models in terms of speech quality and naturalness. The model is also capable of generating speech in multiple languages, including Chinese and English.

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme
TurboType Banner

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!