Thai Semantic End-of-Turn Detection for Real-Time Voice Agents

Thanapol Popit, Natthapath Rungseesiripak, Monthol Charattrakool, Saksorn Ruangtanusak

2025-10-07

Thai Semantic End-of-Turn Detection for Real-Time Voice Agents

Summary

This research focuses on figuring out when someone is done talking in a conversation, specifically for Thai speakers, so computers can respond more naturally and quickly.

What's the problem?

Currently, computer systems rely on detecting silence to know when someone has finished speaking. This method isn't very good because people often pause mid-sentence, or use specific words and sounds that aren't necessarily silence but *do* signal the end of their turn. This causes delays and makes conversations feel unnatural, and this problem hasn't been well-studied for the Thai language.

What's the solution?

The researchers tested different ways to teach computers to detect the end of a turn in Thai conversations, using only the text of what's being said. They compared using large language models with minimal training, to more traditional methods of training smaller computer programs. They used transcripts of Thai conversations and looked for clues in the language itself, like specific particles used at the end of sentences. They found a balance between how accurate the system was and how quickly it could make a decision.

Why it matters?

This work is important because it provides a starting point for building faster and more responsive AI assistants that can understand and interact with Thai speakers in a more natural way. It shows that you don't necessarily need huge, complex programs to get good results, and that smaller, fine-tuned models can work well enough to run directly on devices like phones or smart speakers.

Abstract

Fluid voice-to-voice interaction requires reliable and low-latency detection of when a user has finished speaking. Traditional audio-silence end-pointers add hundreds of milliseconds of delay and fail under hesitations or language-specific phenomena. We present, to our knowledge, the first systematic study of Thai text-only end-of-turn (EOT) detection for real-time agents. We compare zero-shot and few-shot prompting of compact LLMs to supervised fine-tuning of lightweight transformers. Using transcribed subtitles from the YODAS corpus and Thai-specific linguistic cues (e.g., sentence-final particles), we formulate EOT as a binary decision over token boundaries. We report a clear accuracy-latency tradeoff and provide a public-ready implementation plan. This work establishes a Thai baseline and demonstrates that small, fine-tuned models can deliver near-instant EOT decisions suitable for on-device agents.

View Paper