AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

Sidharth Surapaneni, Hoang Nguyen, Jash Mehta, Aman Tiwari, Oluwanifemi Bamgbose, Akshay Kalkunte, Sai Rajeswar, Sathwik Tejaswi Madhusudhan

2025-09-12

AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

Summary

This paper introduces a new toolkit, AU-Harness, designed to better test and evaluate Large Audio Language Models (LALMs), which are AI models that process and understand audio like speech.

What's the problem?

Currently, evaluating these audio AI models is really difficult because the existing tools are slow, don't use consistent instructions for the models, and don't test a wide enough range of audio understanding skills. This makes it hard to fairly compare different models and figure out what they're actually good at, hindering progress in the field.

What's the solution?

The researchers created AU-Harness, a faster and more thorough evaluation system. It speeds up testing by processing information in batches and running multiple tests at the same time, making large-scale evaluations possible. They also created standard ways to give instructions to the models, ensuring fair comparisons. Finally, they added new types of tests focusing on understanding *when* things happen in audio and reasoning about spoken language in complex situations.

Why it matters?

AU-Harness is important because it provides a reliable way to measure the capabilities of these audio AI models. By identifying where current models struggle, particularly with understanding timing and complex spoken language, it helps researchers improve them. It also promotes standardization in how these models are tested, leading to more meaningful comparisons and faster advancements in the field of audio AI.

Abstract

Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient toolkits that limit fair comparison and systematic assessment. Current frameworks suffer from three critical issues: slow processing that bottlenecks large-scale studies, inconsistent prompting that hurts reproducibility, and narrow task coverage that misses important audio reasoning capabilities. We introduce AU-Harness, an efficient and comprehensive evaluation framework for LALMs. Our system achieves a speedup of up to 127% over existing toolkits through optimized batch processing and parallel execution, enabling large-scale evaluations previously impractical. We provide standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios. Additionally, we introduce two new evaluation categories: LLM-Adaptive Diarization for temporal audio understanding and Spoken Language Reasoning for complex audio-based cognitive tasks. Through evaluation across 380+ tasks, we reveal significant gaps in current LALMs, particularly in temporal understanding and complex spoken language reasoning tasks. Our findings also highlight a lack of standardization in instruction modality existent across audio benchmarks, which can lead up performance differences up to 9.5 absolute points on the challenging complex instruction following downstream tasks. AU-Harness provides both practical evaluation tools and insights into model limitations, advancing systematic LALM development.

View Paper