Mind the Gap! Static and Interactive Evaluations of Large Audio Models

Minzhi Li, William Barr Held, Michael J Ryan, Kunat Pipatanakul, Potsawee Manakul, Hao Zhu, Diyi Yang

2025-02-25

Mind the Gap! Static and Interactive Evaluations of Large Audio Models

Summary

This paper talks about a new way to test AI models that can understand and respond to spoken language, comparing traditional testing methods with real-world interactions

What's the problem?

As voice-controlled AI assistants become more common, we need better ways to make sure they're actually helpful to users. Current tests for these Large Audio Models (LAMs) don't always show how well they'll work in real conversations

What's the solution?

The researchers created an interactive test where 484 real people had 7,500 conversations with different LAMs. They looked at what kinds of things people wanted to use voice AI for, which models people liked best, and how well the old-style tests predicted how people would feel about the AI in real conversations

Why it matters?

This matters because it shows that the current ways of testing voice AI might not be good enough. The study found that doing well on standard tests doesn't necessarily mean an AI will be good at talking to real people. This could help companies make voice AI that people actually enjoy using, rather than just ones that score well on technical tests

Abstract

As AI chatbots become ubiquitous, voice interaction presents a compelling way to enable rapid, high-bandwidth communication for both semantic and social signals. This has driven research into Large Audio Models (LAMs) to power voice-native experiences. However, aligning LAM development with user goals requires a clear understanding of user needs and preferences to establish reliable progress metrics. This study addresses these challenges by introducing an interactive approach to evaluate LAMs and collecting 7,500 LAM interactions from 484 participants. Through topic modeling of user queries, we identify primary use cases for audio interfaces. We then analyze user preference rankings and qualitative feedback to determine which models best align with user needs. Finally, we evaluate how static benchmarks predict interactive performance - our analysis reveals no individual benchmark strongly correlates with interactive results (tau leq 0.33 for all benchmarks). While combining multiple coarse-grained features yields modest predictive power (R^2=0.30), only two out of twenty datasets on spoken question answering and age prediction show significantly positive correlations. This suggests a clear need to develop LAM evaluations that better correlate with user preferences.

View Paper