Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices
Evan King, Adam Sabra, Manjunath Kudlur, James Wang, Pete Warden
2025-09-03
Summary
This paper introduces 'Moonshine,' a collection of very small speech recognition programs designed for languages that don't get much attention in typical speech recognition development.
What's the problem?
Usually, people think the best way to build speech recognition for less common languages is to create one big program that handles many languages at once, hoping that similarities between languages will help. However, this approach doesn't work well when you're trying to make *really* small programs, because those small programs need to focus on the specifics of a single language to be accurate.
What's the solution?
The researchers created separate, tiny speech recognition models for each language (Arabic, Chinese, Japanese, Korean, Ukrainian, and Vietnamese). These models are only 27 million parameters in size. They trained these models using a smart combination of real human recordings, computer-generated speech, and speech that was automatically labeled. This careful training process allowed the small models to perform surprisingly well.
Why it matters?
These 'Moonshine' models are a big step forward because they're small enough to run directly on devices like phones or computers *without* needing an internet connection. They are more accurate than much larger existing models, and they make accurate speech recognition possible for languages that previously had limited support, opening up access to voice technology for more people.
Abstract
We present the Flavors of Moonshine, a suite of tiny automatic speech recognition (ASR) models specialized for a range of underrepresented languages. Prevailing wisdom suggests that multilingual ASR models outperform monolingual counterparts by exploiting cross-lingual phonetic similarities. We challenge this assumption, showing that for sufficiently small models (27M parameters), training monolingual systems on a carefully balanced mix of high-quality human-labeled, pseudo-labeled, and synthetic data yields substantially superior performance. On average, our models achieve error rates 48% lower than the comparably sized Whisper Tiny model, outperform the 9x larger Whisper Small model, and in most cases match or outperform the 28x larger Whisper Medium model. These results advance the state of the art for models of this size, enabling accurate on-device ASR for languages that previously had limited support. We release Arabic, Chinese, Japanese, Korean, Ukrainian, and Vietnamese Moonshine models under a permissive open-source license.