Key Features

Unified speech-to-text model
Handles 60-minute long-form audio in a single pass
Supports customized hotwords and over 50 languages
Jointly performs ASR, diarization, and timestamping
Produces structured output with Who, When, and What
Requires no explicit language setting
Natively handles code-switching within and across utterances
Model size of 9B params with BF16 tensor type

VibeVoice-ASR accepts up to 60 minutes of continuous audio input within 64K token length, ensuring consistent speaker tracking and semantic coherence across the entire hour. It also supports customized hotwords, which can significantly improve accuracy on domain-specific content. The model requires no explicit language setting and natively handles code-switching within and across utterances, making it a versatile tool for multilingual applications.


VibeVoice-ASR has a model size of 9B params and uses the BF16 tensor type. It has been downloaded over 111,610 times in the last month and is licensed under the MIT License. The model is not deployed by any inference provider, but it can be used for various applications such as speech-to-text, diarization, and timestamping. It is a valuable tool for researchers and developers working on speech recognition and natural language processing tasks.

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme
TurboType Banner
Zero to AI Engineer Program

Zero to AI Engineer

Skip the degree. Learn real-world AI skills used by AI researchers and engineers. Get certified in 8 weeks or less. No experience required.

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!