On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation
Jeff Chan-Jan Sju, Liang-Hsuan Tseng, Yi-Cheng Lin, Yen-Chun Kuo, Ju-Chieh Chou, Kai-Wei Chang, Hung-yi Lee, Carlos Busso
2026-01-13
Summary
This paper is about how we measure the quality of computer programs that generate speech, like those that can continue a conversation or read text aloud.
What's the problem?
Currently, people use a method borrowed from evaluating text-generating programs to judge speech generation. This method, called 'token perplexity', doesn't really account for the unique qualities of speech – things like how it *sounds* and the subtle cues that make it natural. Because of this, it might be giving us a misleading idea of how good these speech programs actually are, potentially making them seem better than they are.
What's the solution?
The researchers came up with new ways to evaluate speech generation that are specifically designed for audio. Instead of just looking at how well the program predicts the next 'token' (a small piece of sound), they developed methods that focus on how realistic and high-quality the generated speech actually sounds. They then compared the results from these new methods to how people actually rated the speech quality.
Why it matters?
These new evaluation methods are important because they give us a more accurate picture of how well speech generation programs are performing. When using the new methods, the gap between the best programs and human-level speech narrowed, suggesting that we were previously underestimating how much progress was being made. This means better evaluation is crucial for continuing to improve these technologies and building truly realistic and helpful speech-based AI.
Abstract
Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content while preserving attributes like speaker and emotion, serving as foundation models for spoken dialogue. In prior literature, these models are often evaluated using ``global token perplexity'', which directly applies the text perplexity formulation to speech tokens. However, this practice overlooks fundamental differences between speech and text modalities, possibly leading to an underestimation of the speech characteristics. In this work, we propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity. We demonstrate that the proposed evaluations more faithfully reflect perceived generation quality, as evidenced by stronger correlations with human-rated mean opinion scores (MOS). When assessed under the new metrics, the relative performance landscape of spoken language models is reshaped, revealing a significantly reduced gap between the best-performing model and the human topline. Together, these results suggest that appropriate evaluation is critical for accurately assessing progress in spoken language modeling.