VoxCPM

Free Audio Speech Synthesis

LikeWebsite Promote

Key Features

Context-Aware, Expressive Speech Generation

True-to-Life Voice Cloning

High-Efficiency Synthesis

Streaming Synthesis

Real-Time Factor (RTF) as low as 0.17

Fine-Grained Control over Speech Attributes

Support for Multiple Languages

End-to-End Diffusion Autoregressive Architecture

VoxCPM comprehends text to infer and generate appropriate prosody, delivering speech with remarkable expressiveness and natural flow. It spontaneously adapts speaking style based on content, producing highly fitting vocal expression trained on a massive 1.8 million-hour bilingual corpus. With only a short reference audio clip, VoxCPM performs accurate zero-shot voice cloning, capturing not only the speaker’s timbre but also fine-grained characteristics such as accent, emotional tone, rhythm, and pacing to create a faithful and natural replica.

VoxCPM supports streaming synthesis with a Real-Time Factor (RTF) as low as 0.17 on a consumer-grade NVIDIA RTX 4090 GPU, making it possible for real-time applications. The model is highly efficient and can be fine-tuned for specific use cases. VoxCPM achieves competitive results on public zero-shot TTS benchmarks, outperforming other models in terms of speech quality and naturalness. The model is also capable of generating speech in multiple languages, including Chinese and English.

Get more likes & reach the top of search results by adding this button on your site!

VoxCPM

Key Features

Zero to AI Engineer

Subscribe to the AI Search Newsletter