VibeVoice Technical Report
Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, Furu Wei
2025-08-27
Summary
This paper introduces VibeVoice, a new computer model that can create realistic, long recordings of speech with multiple people talking, aiming to sound like a natural conversation.
What's the problem?
Creating long, multi-speaker audio recordings with computers is difficult because it requires a lot of computing power and efficiently storing all the audio information. Existing methods struggle to handle long conversations and often don't sound very natural, or they are computationally expensive.
What's the solution?
The researchers developed VibeVoice, which uses a technique called 'next-token diffusion' to generate speech. They also created a new way to compress audio data, making it 80 times smaller than a popular method called Encodec, without losing audio quality. This allows VibeVoice to create up to 90 minutes of speech with up to four different speakers.
Why it matters?
VibeVoice is important because it can generate much longer and more realistic conversations than previous models. This has potential applications in things like creating audiobooks, realistic characters in video games, or improving voice assistants, and it does so more efficiently than existing technologies.
Abstract
This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.