LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng

2024-09-11

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Summary

This paper talks about LLaMA-Omni, a new model that allows users to interact with large language models (LLMs) using speech instead of just text, making the experience smoother and more efficient.

What's the problem?

While models like GPT-4o have improved real-time speech interaction with LLMs, there hasn't been much focus on creating similar systems using open-source models. This limits the accessibility and usability of speech-based interactions for users who want to communicate with AI through voice.

What's the solution?

To solve this issue, the authors developed LLaMA-Omni, which combines several components: a pretrained speech encoder, a speech adaptor, an LLM, and a streaming speech decoder. This setup allows the model to understand spoken instructions and generate responses in both text and speech without needing to first convert speech into text. They also created a dataset called InstructS2S-200K, containing 200,000 speech instructions and responses to train the model effectively. Experimental results showed that LLaMA-Omni outperformed previous models in both content quality and response speed.

Why it matters?

This research is important because it enhances how people can interact with AI systems using their voices. By making speech interaction more efficient and effective, LLaMA-Omni could improve applications in areas like virtual assistants, customer service, and accessibility for users who prefer or need to use voice commands.

Abstract

Models like GPT-4o enable real-time interaction with large language models (LLMs) through speech, significantly enhancing user experience compared to traditional text-based interaction. However, there is still a lack of exploration on how to build speech interaction models based on open-source LLMs. To address this, we propose LLaMA-Omni, a novel model architecture designed for low-latency and high-quality speech interaction with LLMs. LLaMA-Omni integrates a pretrained speech encoder, a speech adaptor, an LLM, and a streaming speech decoder. It eliminates the need for speech transcription, and can simultaneously generate text and speech responses directly from speech instructions with extremely low latency. We build our model based on the latest Llama-3.1-8B-Instruct model. To align the model with speech interaction scenarios, we construct a dataset named InstructS2S-200K, which includes 200K speech instructions and corresponding speech responses. Experimental results show that compared to previous speech-language models, LLaMA-Omni provides better responses in both content and style, with a response latency as low as 226ms. Additionally, training LLaMA-Omni takes less than 3 days on just 4 GPUs, paving the way for the efficient development of speech-language models in the future.

View Paper