SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

Yuejie Li, Ke Yang, Yueying Hua, Berlin Chen, Jianhao Nie, Yueping He, Caixin Kang

2026-02-16

SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

Summary

This paper introduces a new way to test how well speech recognition systems can find information when someone asks a question out loud, even when there's a lot of background noise.

What's the problem?

Currently, the tests used to check these systems are too simple and don't accurately reflect real-world situations where there's often a lot of noise like traffic or people talking. This makes it hard to know if a system will actually work well when someone uses it in a noisy environment, and existing datasets aren't good at showing how systems handle different kinds of noise.

What's the solution?

The researchers created a large and realistic test set called SQuTR. They took questions from existing datasets and then created speech recordings of people saying those questions, using voices from 200 different people. They then added 17 different types of real-world noise to those recordings at varying levels of loudness, creating a huge number of different noisy speech samples. They then tested several speech recognition systems on this new test set to see how well they performed under different noise conditions.

Why it matters?

This new test set is important because it provides a more accurate and challenging way to evaluate speech recognition systems. It helps researchers understand how well these systems handle noise and identify areas where they need to improve, ultimately leading to better voice-activated search and information retrieval in everyday life.

Abstract

Spoken query retrieval is an important interaction mode in modern information retrieval. However, existing evaluation datasets are often limited to simple queries under constrained noise conditions, making them inadequate for assessing the robustness of spoken query retrieval systems under complex acoustic perturbations. To address this limitation, we present SQuTR, a robustness benchmark for spoken query retrieval that includes a large-scale dataset and a unified evaluation protocol. SQuTR aggregates 37,317 unique queries from six commonly used English and Chinese text retrieval datasets, spanning multiple domains and diverse query types. We synthesize speech using voice profiles from 200 real speakers and mix 17 categories of real-world environmental noise under controlled SNR levels, enabling reproducible robustness evaluation from quiet to highly noisy conditions. Under the unified protocol, we conduct large-scale evaluations on representative cascaded and end-to-end retrieval systems. Experimental results show that retrieval performance decreases as noise increases, with substantially different drops across systems. Even large-scale retrieval models struggle under extreme noise, indicating that robustness remains a critical bottleneck. Overall, SQuTR provides a reproducible testbed for benchmarking and diagnostic analysis, and facilitates future research on robustness in spoken query to text retrieval.

View Paper