MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, Dinesh Manocha

2024-10-28

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

Summary

This paper introduces MMAU, a new benchmark designed to evaluate how well audio understanding models can comprehend and reason about different types of sounds, including speech, music, and environmental noises.

What's the problem?

Understanding audio is essential for AI systems to interact effectively with the world. However, current benchmarks for testing these abilities are limited and often focus on narrow tasks. This makes it hard to assess how well models can perform complex audio comprehension and reasoning tasks that require expert-like knowledge.

What's the solution?

The authors created MMAU, which includes 10,000 carefully selected audio clips along with human-written questions and answers. These cover a wide range of tasks related to speech, environmental sounds, and music. The benchmark tests models on 27 different skills, emphasizing advanced reasoning and perception. They evaluated 18 different audio-language models using MMAU to see how well they performed on these complex tasks.

Why it matters?

This research is important because it sets a new standard for evaluating audio understanding in AI models. By providing a comprehensive benchmark like MMAU, researchers can better understand the strengths and weaknesses of their models, leading to improvements in how AI systems interpret and respond to audio in real-world applications.

Abstract

The ability to comprehend audio--which includes speech, non-speech sounds, and music--is crucial for AI agents to interact effectively with the world. We present MMAU, a novel benchmark designed to evaluate multimodal audio understanding models on tasks requiring expert-level knowledge and complex reasoning. MMAU comprises 10k carefully curated audio clips paired with human-annotated natural language questions and answers spanning speech, environmental sounds, and music. It includes information extraction and reasoning questions, requiring models to demonstrate 27 distinct skills across unique and challenging tasks. Unlike existing benchmarks, MMAU emphasizes advanced perception and reasoning with domain-specific knowledge, challenging models to tackle tasks akin to those faced by experts. We assess 18 open-source and proprietary (Large) Audio-Language Models, demonstrating the significant challenges posed by MMAU. Notably, even the most advanced Gemini Pro v1.5 achieves only 52.97% accuracy, and the state-of-the-art open-source Qwen2-Audio achieves only 52.50%, highlighting considerable room for improvement. We believe MMAU will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks.

View Paper