VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
Zuwei Long, Yunhang Shen, Chaoyou Fu, Heting Gao, Lijiang Li, Peixian Chen, Mengdan Zhang, Hang Shao, Jian Li, Jinlong Peng, Haoyu Cao, Ke Li, Rongrong Ji, Xing Sun
2025-05-07
Summary
This paper talks about VITA-Audio, a new speech model that can quickly and efficiently process and generate audio and language together, making it much faster for real-time applications like voice assistants or live translations.
What's the problem?
Current speech and language models can be slow when handling streaming audio, which causes delays and makes them less useful for things that need quick responses, like live conversations or instant translations.
What's the solution?
The researchers created a lightweight module called MCTP that helps the model generate several audio tokens at once, speeding up the process and making speech recognition and generation much more efficient.
Why it matters?
This matters because it means voice-based technology can become more responsive and effective, improving experiences for users in things like smart assistants, customer service, and accessibility tools.
Abstract
VITA-Audio, an end-to-end speech model, introduces a lightweight MCTP module to generate multiple audio tokens efficiently, reducing latency in streaming applications with enhanced speech processing capabilities.