MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

Guoli Yin, Haoping Bai, Shuang Ma, Feng Nan, Yanchao Sun, Zhaoyang Xu, Shen Ma, Jiarui Lu, Xiang Kong, Aonan Zhang, Dian Ang Yap, Yizhe zhang, Karsten Ahnert, Vik Kamath, Mathias Berglund, Dominic Walsh, Tobias Gindele, Juergen Wiest, Zhengfeng Lai, Xiaoming Wang, Jiulong Shan, Meng Cao

2024-07-30

MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

Summary

This paper introduces MMAU, a new benchmark designed to evaluate the capabilities of large language models (LLMs) as human-like agents across various tasks. It aims to provide a more detailed understanding of their strengths and weaknesses.

What's the problem?

Current benchmarks for evaluating LLMs often focus on specific tasks, like completing a job or answering questions, but they don't break down the skills needed to achieve those tasks. This makes it hard to understand why a model might fail or succeed. Additionally, setting up these evaluation environments can be complicated and sometimes unreliable, especially for tasks that require interaction.

What's the solution?

MMAU addresses these issues by offering a comprehensive set of offline tasks that do not require complex setups. It evaluates models in five different areas: tool use, question answering with directed acyclic graphs, data science and machine learning coding, contest-level programming, and mathematics. The benchmark tests five key capabilities: understanding, reasoning, planning, problem-solving, and self-correction. With 20 carefully designed tasks and over 3,000 distinct prompts, MMAU provides a thorough framework for assessing LLMs. The researchers tested 18 different models using MMAU to gain insights into their performance.

Why it matters?

This research is important because it helps improve our understanding of how well AI models can perform various tasks and where they might struggle. By providing a more detailed evaluation method, MMAU can guide future developments in AI technology, ensuring that these models become more capable and reliable in real-world applications.

Abstract

Recent advances in large language models (LLMs) have increased the demand for comprehensive benchmarks to evaluate their capabilities as human-like agents. Existing benchmarks, while useful, often focus on specific application scenarios, emphasizing task completion but failing to dissect the underlying skills that drive these outcomes. This lack of granularity makes it difficult to deeply discern where failures stem from. Additionally, setting up these environments requires considerable effort, and issues of unreliability and reproducibility sometimes arise, especially in interactive tasks. To address these limitations, we introduce the Massive Multitask Agent Understanding (MMAU) benchmark, featuring comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including teal{Tool-use}, teal{Directed Acyclic Graph (DAG) QA}, teal{Data Science and Machine Learning coding}, teal{Contest-level programming} and teal{Mathematics}, and covers five essential capabilities: orange{Understanding}, orange{Reasoning}, orange{Planning}, orange{Problem-solving}, and orange{Self-correction}. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents. By testing 18 representative models on MMAU, we provide deep and insightful analyses. Ultimately, MMAU not only sheds light on the capabilities and limitations of LLM agents but also enhances the interpretability of their performance. Datasets and evaluation scripts of MMAU are released at https://github.com/apple/axlearn/docs/research/mmau.

View Paper