Do What I Say: A Spoken Prompt Dataset for Instruction-Following
Maike Züfle, Sara Papi, Fabian Retkowski, Szymon Mazurek, Marek Kasztelnik, Alexander Waibel, Luisa Bentivogli, Jan Niehues
2026-03-11
Summary
This paper focuses on how well speech-based large language models, or SLLMs, actually perform when given instructions through spoken language, as opposed to just typed text.
What's the problem?
Currently, SLLMs are mostly tested using written prompts. This isn't realistic because people often *speak* to these models, especially with voice assistants. Testing with only text doesn't show how well the model understands spoken instructions, particularly in different languages or with varying speaking styles, and it doesn't reflect real-world use.
What's the solution?
The researchers created a new dataset called DoWhatISay (DOWIS). This dataset includes the same instructions given both as spoken audio and written text, across nine different tasks and eleven languages. They had multiple people record the prompts in different ways to make it more realistic. Then, they tested several existing SLLMs using this dataset to see how performance differed between spoken and written prompts.
Why it matters?
The findings show that SLLMs generally perform better when given written instructions compared to spoken ones, especially when dealing with less common languages or when translating between languages. However, spoken prompts work just as well when the task involves generating speech as output. This highlights the need to evaluate SLLMs using spoken language to get a true picture of their capabilities and to improve their understanding of voice commands.
Abstract
Speech Large Language Models (SLLMs) have rapidly expanded, supporting a wide range of tasks. These models are typically evaluated using text prompts, which may not reflect real-world scenarios where users interact with speech. To address this gap, we introduce DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. Spanning 9 tasks and 11 languages, it provides 10 prompt variants per task-language pair, across five styles. Using DOWIS, we benchmark state-of-the-art SLLMs, analyzing the interplay between prompt modality, style, language, and task type. Results show that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output, spoken prompts do close the gap, highlighting the need for speech-based prompting in SLLM evaluation.