From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM
Kshitij Ambilduke, Ben Peters, Sonal Sannigrahi, Anil Keshwani, Tsz Kin Lam, Bruno Martins, Marcely Zanon Boito, André F. T. Martins
2025-03-17
Summary
This paper explores adding speech capabilities to a text-based AI model, TOWER, by treating speech as another language for the AI to learn.
What's the problem?
AI models are increasingly good at understanding and generating text, but it's challenging to integrate other forms of communication, like speech, into these models, especially when dealing with multiple languages.
What's the solution?
The researchers converted speech into a format similar to text and then trained the AI model, TOWER, to recognize and process this speech data as if it were another language. The resulting model, SPIRE, can understand and translate spoken English while still performing its original text-based tasks.
Why it matters?
This work matters because it shows a relatively simple way to add speech recognition and translation capabilities to existing AI models, potentially leading to more versatile and user-friendly AI systems.
Abstract
Large language models (LLMs) have shown remarkable performance and generalization capabilities across multiple languages and tasks, making them very attractive targets for multi-modality integration (e.g., images or speech). In this work, we extend an existing LLM to the speech modality via speech discretization and continued pre-training. In particular, we are interested in multilingual LLMs, such as TOWER, as their pre-training setting allows us to treat discretized speech input as an additional translation language. The resulting open-source model, SPIRE, is able to transcribe and translate English speech input while maintaining TOWER's original performance on translation-related tasks, showcasing that discretized speech input integration as an additional language is feasible during LLM adaptation. We make our code and models available to the community.