LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation

Keisuke Kamahori, Jungo Kasai, Noriyuki Kojima, Baris Kasikci

2025-03-03

LiteASR: Efficient Automatic Speech Recognition with Low-Rank
Approximation

Summary

This paper talks about LiteASR, a new method to make speech recognition models faster and smaller without losing accuracy. It focuses on improving how these models process sound to text by using smarter math techniques.

What's the problem?

Modern speech recognition systems, like OpenAI's Whisper, are very accurate but require a lot of computing power to work. This makes them slow and expensive to use, especially for smaller devices or companies with limited resources.

What's the solution?

The researchers created LiteASR, which uses a technique called low-rank approximation to shrink the size of the model's encoder, the part that processes sound into text. They analyzed how the model works and used a method called principal component analysis (PCA) to simplify its calculations. They also optimized how the model handles attention, which is how it focuses on important parts of the audio. This reduced the size of the model by over 50% while maintaining or even improving its accuracy.

Why it matters?

This matters because LiteASR makes speech recognition technology more efficient and accessible. By reducing the size and speed requirements of these models, it becomes easier to use them on smaller devices like phones or in real-time applications. This could lead to better voice assistants, transcription tools, and other technologies that rely on converting speech to text.

Abstract

Modern automatic speech recognition (ASR) models, such as OpenAI's Whisper, rely on deep encoder-decoder architectures, and their encoders are a critical bottleneck for efficient deployment due to high computational intensity. We introduce LiteASR, a low-rank compression scheme for ASR encoders that significantly reduces inference costs while maintaining transcription accuracy. Our approach leverages the strong low-rank properties observed in intermediate activations: by applying principal component analysis (PCA) with a small calibration dataset, we approximate linear transformations with a chain of low-rank matrix multiplications, and further optimize self-attention to work in the reduced dimension. Evaluation results show that our method can compress Whisper large-v3's encoder size by over 50%, matching Whisper medium's size with better transcription accuracy, thereby establishing a new Pareto-optimal frontier of efficiency and performance. The code of LiteASR is available at https://github.com/efeslab/LiteASR.

View Paper