The model uses an interleaved, windowed design, incrementally encoding incoming text chunks while continuing diffusion-based acoustic latent generation from prior context. It relies solely on an efficient acoustic tokenizer operating at an ultra-low frame rate, achieving 3200x downsampling from 24kHz input. Although primarily built for English, the model exhibits a certain level of multilingual capability and performs reasonably well in some languages.
VibeVoice-Realtime has been evaluated on various benchmarks, including the LibriSpeech test-clean set and the SEED test-en set, demonstrating satisfactory performance on short-sentence benchmarks and a focus on long-form speech generation. The model is intended for research and development purposes only, and users are recommended to disclose the use of AI-generated content and use the model responsibly, ensuring compliance with applicable laws and regulations.

