VibeVoice-ASR accepts up to 60 minutes of continuous audio input within 64K token length, ensuring consistent speaker tracking and semantic coherence across the entire hour. It also supports customized hotwords, which can significantly improve accuracy on domain-specific content. The model requires no explicit language setting and natively handles code-switching within and across utterances, making it a versatile tool for multilingual applications.
VibeVoice-ASR has a model size of 9B params and uses the BF16 tensor type. It has been downloaded over 111,610 times in the last month and is licensed under the MIT License. The model is not deployed by any inference provider, but it can be used for various applications such as speech-to-text, diarization, and timestamping. It is a valuable tool for researchers and developers working on speech recognition and natural language processing tasks.


