MAEB: Massive Audio Embedding Benchmark

Adnan El Assadi, Isaac Chung, Chenghao Xiao, Roman Solomatin, Animesh Jha, Rahul Chand, Silky Singh, Kaitlyn Wang, Ali Sartaz Khan, Marc Moussa Nasser, Sufen Fong, Pengfei He, Alan Xiao, Ayush Sunil Munot, Aditya Shrivastava, Artem Gazizov, Niklas Muennighoff, Kenneth Enevoldsen

2026-02-19

Summary

This paper introduces a new way to test how well computer models understand audio, called the Massive Audio Embedding Benchmark, or MAEB. It's a big collection of different audio-related challenges, covering things like speech, music, sounds from the environment, and even understanding audio when paired with text, all in over 100 languages.

What's the problem?

Currently, there wasn't a single, comprehensive test to see how good audio models *really* are. Existing tests often focus on just one type of audio or language. Researchers found that models that are great at identifying sounds like a dog barking struggle with understanding speech in different languages, and vice versa. It's hard to get a clear picture of which models are truly versatile and well-rounded in their audio understanding, and clustering audio is still a big challenge for all models.

What's the solution?

The researchers created MAEB, which includes 30 different audio tasks drawn from a larger set of 98 tasks. They then tested over 50 different audio models on these tasks. This allowed them to compare how well each model performed across a wide range of audio challenges. They also found a connection between how well a model does on MAEB and how well it performs when used as part of a larger 'audio large language model', which are becoming increasingly popular.

Why it matters?

This benchmark is important because it provides a standardized way to evaluate and compare audio models. It helps researchers identify the strengths and weaknesses of different approaches, and ultimately leads to the development of better audio understanding technology. It also fits into a larger effort to create unified benchmarks for evaluating models across different types of data – text, images, *and* audio.

Abstract

We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio-text models excel at environmental sound classification (e.g., ESC50) but score near random on multilingual speech tasks (e.g., SIB-FLEURS), while speech-pretrained models show the opposite pattern. Clustering remains challenging for all models, with even the best-performing model achieving only modest results. We observe that models excelling on acoustic understanding often perform poorly on linguistic tasks, and vice versa. We also show that the performance of audio encoders on MAEB correlates highly with their performance when used in audio large language models. MAEB is derived from MAEB+, a collection of 98 tasks. MAEB is designed to maintain task diversity while reducing evaluation cost, and it integrates into the MTEB ecosystem for unified evaluation across text, image, and audio modalities. We release MAEB and all 98 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.

View Paper