< Explain other AI papers

Soundwave: Less is More for Speech-Text Alignment in LLMs

Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li

2025-02-19

Soundwave: Less is More for Speech-Text Alignment in LLMs

Summary

This paper talks about Soundwave, a new AI model that makes it easier to align speech and text for tasks like speech translation. It uses less training data than other models while still performing better.

What's the problem?

Current speech-to-text AI models need huge amounts of annotated data to work well, which makes training them expensive and time-consuming. They also struggle with differences in how speech and text are represented and the mismatch in their sequence lengths.

What's the solution?

The researchers created Soundwave, which uses a new training strategy and architecture to fix these issues. It employs adapters to align speech and text representations and uses a shrinking technique to reduce the length of speech sequences. This allows Soundwave to outperform other models like Qwen2-Audio while using only one-fiftieth of the training data.

Why it matters?

This matters because Soundwave makes speech-to-text AI models more efficient and accessible by reducing the need for massive datasets. It could improve applications like real-time translation and voice assistants, making them faster, cheaper, and more effective for everyday use.

Abstract

Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose <PRE_TAG>Soundwave</POST_TAG>, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that <PRE_TAG>Soundwave</POST_TAG> outperforms the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks, using only one-fiftieth of the training data. Further analysis shows that <PRE_TAG>Soundwave</POST_TAG> still retains its intelligence during conversation. The project is available at https://github.com/FreedomIntelligence/<PRE_TAG>Soundwave</POST_TAG>.