VibeVoice-Realtime-0.5B

NEW

Free Speech Text Generation

LikeWebsite Promote

Key Features

Lightweight real-time text-to-speech model

Streaming text input support

Robust long-form speech generation

Single speaker support

Efficient acoustic tokenizer

Ultra-low frame rate operation

Multilingual capability

Real-time TTS services support

The model uses an interleaved, windowed design, incrementally encoding incoming text chunks while continuing diffusion-based acoustic latent generation from prior context. It relies solely on an efficient acoustic tokenizer operating at an ultra-low frame rate, achieving 3200x downsampling from 24kHz input. Although primarily built for English, the model exhibits a certain level of multilingual capability and performs reasonably well in some languages.

VibeVoice-Realtime has been evaluated on various benchmarks, including the LibriSpeech test-clean set and the SEED test-en set, demonstrating satisfactory performance on short-sentence benchmarks and a focus on long-form speech generation. The model is intended for research and development purposes only, and users are recommended to disclose the use of AI-generated content and use the model responsibly, ensuring compliance with applicable laws and regulations.

Get more likes & reach the top of search results by adding this button on your site!

VibeVoice-Realtime-0.5B

Key Features

Zero to AI Engineer

Subscribe to the AI Search Newsletter