Distilling an End-to-End Voice Assistant Without Instruction Training Data

William Held, Ella Li, Michael Ryan, Weiyan Shi, Yanzhe Zhang, Diyi Yang

2024-10-04

Distilling an End-to-End Voice Assistant Without Instruction Training Data

Summary

This paper discusses a new approach to creating a voice assistant called DiVA (Distilled Voice Assistant) that can understand and respond to spoken language without needing a lot of instruction data.

What's the problem?

Most voice assistants, like Siri and Google Assistant, treat audio and text separately, which can lead to losing important speech information and making the system more complicated. Additionally, when trying to improve these systems with new training methods, they often forget how to do things they learned from text-only models, which limits their capabilities.

What's the solution?

The authors propose a new method for training voice assistants that doesn't rely on large amounts of labeled instruction data. Instead, they use responses from a text-only language model to guide the training process, which allows the voice assistant to learn from its own outputs. This method is called self-supervision and works without needing detailed annotations. The resulting model, DiVA, shows strong performance in tasks like answering spoken questions and translating speech while using significantly less computing power than other models.

Why it matters?

This research is important because it makes it easier and more efficient to create effective voice assistants that can understand and respond to spoken language. By reducing the need for extensive training data, DiVA can help advance the development of AI technologies that are more accessible and capable of understanding human speech in a variety of contexts.

Abstract

Voice assistants, such as Siri and Google Assistant, typically model audio and text separately, resulting in lost speech information and increased complexity. Recent efforts to address this with end-to-end Speech Large Language Models (LLMs) trained with supervised finetuning (SFT) have led to models ``forgetting" capabilities from text-only LLMs. Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Importantly, this process can be performed without annotated responses. We show that our Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation. Furthermore, we show that DiVA better meets user preferences, achieving a 72\% win rate compared with state-of-the-art models like Qwen 2 Audio, despite using >100x less training compute.

View Paper