Beyond Unified Models: A Service-Oriented Approach to Low Latency, Context Aware Phonemization for Real Time TTS

Mahta Fetrat, Donya Navabi, Zahra Dehghanian, Morteza Abolghasemi, Hamid R. Rabiee

2025-12-11

Beyond Unified Models: A Service-Oriented Approach to Low Latency, Context Aware Phonemization for Real Time TTS

Summary

This paper focuses on making text-to-speech systems faster and more accurate, especially for things like accessibility tools.

What's the problem?

Current text-to-speech systems face a trade-off: quick systems use simple methods for breaking down words into sounds, which can lead to mispronunciations, while more accurate methods are too slow for real-time use. It's hard to have both speed and good pronunciation quality at the same time.

What's the solution?

The researchers created a system that separates the complex part of understanding language (phonemization, or turning text into sounds) from the actual speech generation. They made the phonemization process more intelligent so it considers the context of words, but they run it as a separate 'service' that doesn't slow down the core speech engine. This allows for better pronunciation without sacrificing speed.

Why it matters?

This work is important because it allows for high-quality, accurate text-to-speech to be used in real-time applications, like on your phone or in assistive devices, making these technologies more useful and accessible to everyone.

Abstract

Lightweight, real-time text-to-speech systems are crucial for accessibility. However, the most efficient TTS models often rely on lightweight phonemizers that struggle with context-dependent challenges. In contrast, more advanced phonemizers with a deeper linguistic understanding typically incur high computational costs, which prevents real-time performance. This paper examines the trade-off between phonemization quality and inference speed in G2P-aided TTS systems, introducing a practical framework to bridge this gap. We propose lightweight strategies for context-aware phonemization and a service-oriented TTS architecture that executes these modules as independent services. This design decouples heavy context-aware components from the core TTS engine, effectively breaking the latency barrier and enabling real-time use of high-quality phonemization models. Experimental results confirm that the proposed system improves pronunciation soundness and linguistic accuracy while maintaining real-time responsiveness, making it well-suited for offline and end-device TTS applications.

View Paper