Are We Done with MMLU?

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, Pasquale Minervini

2024-06-13

Summary

This paper discusses the problems found in the MMLU benchmark, which is commonly used to evaluate language models. The authors identify many errors in the dataset and propose a new, corrected version called MMLU-Redux.

What's the problem?

The MMLU benchmark is widely used to test how well language models understand and generate text. However, the authors found that a significant number of questions in MMLU contain errors—57% of questions in the Virology section alone. These mistakes can lead to incorrect evaluations of language models' abilities, making it hard to know how well they actually perform.

What's the solution?

To solve this problem, the authors created MMLU-Redux, a new subset of the MMLU dataset that includes 3,000 questions that have been carefully reviewed and corrected by experts. They developed a framework for identifying and categorizing the types of errors found in the original dataset. By using MMLU-Redux to re-evaluate language models, they showed that the performance metrics changed significantly, indicating that the original dataset's errors had a major impact on results.

Why it matters?

This research is important because it highlights the need for accurate benchmarks in evaluating AI models. By correcting the errors in MMLU and providing MMLU-Redux for further use, the authors aim to improve the reliability of language model evaluations. This can help researchers develop better AI systems that understand language more accurately.

Abstract

Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive framework for identifying dataset errors using a novel error taxonomy. Then, we create MMLU-Redux, which is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. Using MMLU-Redux, we demonstrate significant discrepancies with the model performance metrics that were originally reported. Our results strongly advocate for revising MMLU's error-ridden questions to enhance its future utility and reliability as a benchmark. Therefore, we open up MMLU-Redux for additional annotation https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux.

View Paper