FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Tongyi SpeechTeam

2024-07-08

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Summary

This paper talks about FunAudioLLM, a family of models designed to improve how people interact with large language models (LLMs) using voice. It includes two main models: SenseVoice for understanding speech and CosyVoice for generating speech.

What's the problem?

The main problem is that while LLMs can process text well, they need better ways to understand and generate spoken language. This includes recognizing different languages, understanding emotions in speech, and producing natural-sounding voices that can express different styles and tones. Current models often struggle with these tasks, leading to less effective interactions.

What's the solution?

To solve this issue, the authors developed two innovative models. SenseVoice focuses on recognizing speech in multiple languages and detecting emotions and audio events, while CosyVoice is designed for generating natural-sounding speech in various languages with control over how it sounds. SenseVoice comes in two versions: SenseVoice-Small for quick responses in five languages and SenseVoice-Large for more accurate recognition in over 50 languages. By combining these models with LLMs, FunAudioLLM enables applications like speech-to-speech translation and interactive voice chats.

Why it matters?

This research is important because it enhances how we can use voice technology in everyday applications. By improving the ability to understand and generate speech naturally, FunAudioLLM can lead to better communication tools, such as more responsive virtual assistants, engaging podcasts, and expressive audiobooks. This advancement pushes the boundaries of how humans and machines interact through voice.

Abstract

This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM.

View Paper