LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study
Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee
2024-09-17

Summary
This paper discusses how large language models (LLMs) can be used to convert written words into their spoken sounds (grapheme-to-phoneme or G2P conversion) and introduces new methods to improve this process.
What's the problem?
Grapheme-to-phoneme conversion is important for speech-related applications, like text-to-speech systems. However, traditional G2P systems struggle with understanding the context of words, especially in languages like Persian where some words can be pronounced differently depending on their usage. This lack of contextual understanding can lead to incorrect pronunciations.
What's the solution?
The researchers created a new dataset called AuditoryBench to evaluate how well LLMs perform in G2P tasks. They developed prompting and post-processing techniques that help enhance the outputs of LLMs without needing extra training or labeled data. Their experiments showed that these methods allow LLMs to outperform traditional G2P tools, even for the Persian language, which is often underrepresented in research.
Why it matters?
This research is significant because it demonstrates the potential of using advanced AI models to improve how machines understand and generate speech. By enhancing the accuracy of G2P conversion, this work can lead to better text-to-speech systems and voice recognition technologies, making them more effective in various applications like virtual assistants and language learning tools.
Abstract
Grapheme-to-phoneme (G2P) conversion is critical in speech processing, particularly for applications like speech synthesis. G2P systems must possess linguistic understanding and contextual awareness of languages with polyphone words and context-dependent phonemes. Large language models (LLMs) have recently demonstrated significant potential in various language tasks, suggesting that their phonetic knowledge could be leveraged for G2P. In this paper, we evaluate the performance of LLMs in G2P conversion and introduce prompting and post-processing methods that enhance LLM outputs without additional training or labeled data. We also present a benchmarking dataset designed to assess G2P performance on sentence-level phonetic challenges of the Persian language. Our results show that by applying the proposed methods, LLMs can outperform traditional G2P tools, even in an underrepresented language like Persian, highlighting the potential of developing LLM-aided G2P systems.