EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge
Ruskin Raj Manku, Yuzhi Tang, Xingjian Shi, Mu Li, Alex Smola
2025-06-02
Summary
This paper talks about EmergentTTS-Eval, a new way to test how well text-to-speech (TTS) systems can handle tricky and expressive speech, using AI models to both create and judge the test cases automatically.
What's the problem?
The problem is that current TTS systems often struggle with making speech sound natural, especially when the text is complicated, emotional, or needs special emphasis, and it's hard to measure how well these systems are doing on these challenges.
What's the solution?
The researchers built EmergentTTS-Eval, which uses advanced AI to automatically come up with challenging examples and then uses another AI to judge how well the TTS system performs. This setup allows for a more detailed and fair evaluation of TTS systems on things like expressiveness and complex language.
Why it matters?
This is important because it helps improve TTS technology, making computer-generated voices sound more human and expressive, which is useful for virtual assistants, audiobooks, accessibility tools, and more.
Abstract
A comprehensive TTS benchmark, EmergentTTS-Eval, automates test-case generation and evaluation using LLMs and LALM to assess nuanced and semantically complex text in speech outputs.