VibeVoice-Realtime-0.5B

NEW

Key Features

Lightweight real-time text-to-speech model
Streaming text input support
Robust long-form speech generation
Single speaker support
Efficient acoustic tokenizer
Ultra-low frame rate operation
Multilingual capability
Real-time TTS services support

The model uses an interleaved, windowed design, incrementally encoding incoming text chunks while continuing diffusion-based acoustic latent generation from prior context. It relies solely on an efficient acoustic tokenizer operating at an ultra-low frame rate, achieving 3200x downsampling from 24kHz input. Although primarily built for English, the model exhibits a certain level of multilingual capability and performs reasonably well in some languages.


VibeVoice-Realtime has been evaluated on various benchmarks, including the LibriSpeech test-clean set and the SEED test-en set, demonstrating satisfactory performance on short-sentence benchmarks and a focus on long-form speech generation. The model is intended for research and development purposes only, and users are recommended to disclose the use of AI-generated content and use the model responsibly, ensuring compliance with applicable laws and regulations.

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme
TurboType Banner

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!