LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis

Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, Yang Feng

2025-05-06

LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive
Streaming Speech Synthesis

Summary

This paper talks about LLaMA-Omni2, a new set of AI models that can have real-time voice conversations with people, sounding natural and responding quickly, even though they were trained with less data than other models.

What's the problem?

Most voice chatbots either need a lot of training data to sound good or can't keep up with real-time conversations, making them slow or less realistic when talking to users.

What's the solution?

The researchers built speech models that use a special encoder and a streaming decoder, allowing the chatbot to listen and reply smoothly, and they showed it works better than other advanced models, even with less training.

Why it matters?

This matters because it means we can have more natural and faster voice interactions with AI, making virtual assistants, customer service, and language learning tools more helpful and enjoyable for everyone.

Abstract

LLaMA-Omni 2, a series of speech language models with parameters ranging from 0.5B to 14B, achieves high-quality real-time speech interaction through a speech encoder and autoregressive streaming speech decoder, outperforming models like GLM-4-Voice with significantly less training data.

View Paper