Higgs Audio V2 represents a significant leap forward in audio AI capabilities. It allows for multi-speaker conversations, long form audio generation, and high fidelity audio. The model is trained on a massive self-annotated corpus of over 10M hours of audio data, using BosonAI's ASR, and LLM models. Higgs Audio V2 adopts an innovative Dual-FFN architecture that is capable of handling text and audio tokens jointly. Moreover, the tokenizer has dedicated representations for both semantic and acoustic aspects of the audio.
Higgs Audio V2 is now open source, making it the first open-source, large-scale audio model that excels at multi-speaker, lifelike and emotionally competent voice generation. It opens doors for developers, creatives, and researchers to build conversational agents, audiobooks, podcasts, and more with lifelike performance. Higgs Audio V2 has achieved state-of-art performance, beating gpt-4o-mini-tts with 75.7% win rate on Emotions and 55.7% on Questions in EmergentTTS-Eval. The model is available for cloning on GitHub, and can also be tried out through the online demo or HuggingFace Space.