VibeVoice-ASR

Free Speech Recognition Software

LikeWebsite Promote

Key Features

Unified speech-to-text model

Handles 60-minute long-form audio in a single pass

Supports customized hotwords and over 50 languages

Jointly performs ASR, diarization, and timestamping

Produces structured output with Who, When, and What

Requires no explicit language setting

Natively handles code-switching within and across utterances

Model size of 9B params with BF16 tensor type

VibeVoice-ASR accepts up to 60 minutes of continuous audio input within 64K token length, ensuring consistent speaker tracking and semantic coherence across the entire hour. It also supports customized hotwords, which can significantly improve accuracy on domain-specific content. The model requires no explicit language setting and natively handles code-switching within and across utterances, making it a versatile tool for multilingual applications.

VibeVoice-ASR has a model size of 9B params and uses the BF16 tensor type. It has been downloaded over 111,610 times in the last month and is licensed under the MIT License. The model is not deployed by any inference provider, but it can be used for various applications such as speech-to-text, diarization, and timestamping. It is a valuable tool for researchers and developers working on speech recognition and natural language processing tasks.

Get more likes & reach the top of search results by adding this button on your site!

VibeVoice-ASR

Key Features

Zero to AI Engineer

Subscribe to the AI Search Newsletter