Game-Time: Evaluating Temporal Dynamics in Spoken Language Models
Kai-Wei Chang, En-Pei Hu, Chun-Yi Kuan, Wenze Ren, Wei-Chih Chen, Guan-Ting Lin, Yu Tsao, Shao-Hua Sun, Hung-yi Lee, James Glass
2025-10-06
Summary
This paper investigates how well current speech-based AI models can handle the timing aspects of natural conversations, like keeping a consistent pace or talking at the same time as another person.
What's the problem?
Existing conversational AI models are good at understanding *what* is said, but they struggle with *when* things are said. Specifically, they aren't very good at managing the flow of a conversation, including things like responding at the right speed, following instructions that involve timing, or handling situations where multiple people are speaking simultaneously. There wasn't a good way to systematically test these abilities before.
What's the solution?
The researchers created a new testing framework called 'Game-Time Benchmark'. This framework includes a series of tasks, starting with simple instructions and moving to more complex scenarios that require the AI to pay attention to timing and respond in sync with a 'partner'. They then tested several different AI models using this benchmark to see how they performed.
Why it matters?
This work is important because it highlights a significant weakness in current conversational AI. By creating a standardized way to measure these timing-related skills, the researchers hope to encourage the development of AI systems that can have more natural and fluid conversations with humans, making interactions feel less robotic and more realistic.
Abstract
Conversational Spoken Language Models (SLMs) are emerging as a promising paradigm for real-time speech interaction. However, their capacity of temporal dynamics, including the ability to manage timing, tempo and simultaneous speaking, remains a critical and unevaluated challenge for conversational fluency. To address this gap, we introduce the Game-Time Benchmark, a framework to systematically assess these temporal capabilities. Inspired by how humans learn a language through language activities, Game-Time consists of basic instruction-following tasks and advanced tasks with temporal constraints, such as tempo adherence and synchronized responses. Our evaluation of diverse SLM architectures reveals a clear performance disparity: while state-of-the-art models handle basic tasks well, many contemporary systems still struggle with fundamental instruction-following. More critically, nearly all models degrade substantially under temporal constraints, exposing persistent weaknesses in time awareness and full-duplex interaction. The Game-Time Benchmark provides a foundation for guiding future research toward more temporally-aware conversational AI. Demos and datasets are available on our project website https://ga642381.github.io/Game-Time.