Quantization for OpenAI's Whisper Models: A Comparative Analysis

Allison Andreyev

2025-03-14

Quantization for OpenAI's Whisper Models: A Comparative Analysis

Summary

This paper investigates ways to make OpenAI's Whisper models, which convert speech to text, more efficient by reducing their size and speeding them up.

What's the problem?

Whisper models are useful for tasks like creating captions and translating speech, but the larger versions can be slow and difficult to use on devices with limited resources. Also, they sometimes create inaccurate or 'hallucinated' content.

What's the solution?

The researchers tested different methods of 'quantization,' which reduces the precision of the model's numbers, on three different Whisper models. They measured how this affected the speed and accuracy of the models when transcribing speech.

Why it matters?

This work matters because it shows how to make speech-to-text models smaller and faster without significantly sacrificing accuracy, making them more practical for use on devices like phones or in real-time applications.

Abstract

Automated speech recognition (ASR) models have gained prominence for applications such as captioning, speech translation, and live transcription. This paper studies Whisper and two model variants: one optimized for live speech streaming and another for offline transcription. Notably, these models have been found to generate hallucinated content, reducing transcription reliability. Furthermore, larger model variants exhibit increased latency and pose challenges for deployment on resource-constrained devices. This study analyzes the similarities and differences between three Whisper models, qualitatively examining their distinct capabilities. Next, this study quantifies the impact of model quantization on latency and evaluates its viability for edge deployment. Using the open source LibriSpeech dataset, this paper evaluates the word error rate (WER) along with latency analysis of whispercpp using 3 quantization methods (INT4, INT5, INT8). Results show that quantization reduces latency by 19\% and model size by 45\%, while preserving transcription accuracy. These findings provide insights into the optimal use cases of different Whisper models and edge device deployment possibilities. All code, datasets, and implementation details are available in a public GitHub repository: https://github.com/allisonandreyev/WhisperQuantization.git

View Paper