Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Daron Anderson, Tung Nguyen, Mobeen Mahmood, Fiona Feng, Steven Y. Feng, Haoran Zhao, Michael Yu

2025-01-27

Summary

This paper talks about a new test for AI called Humanity's Last Exam (HLE). It's designed to be really hard, even for the smartest AI systems, to see how close they are to matching human experts in various subjects.

What's the problem?

Current tests for AI language models are becoming too easy. Some AIs are scoring over 90% on popular tests, which means we can't really tell how smart these AIs are getting or how they compare to human experts. It's like giving a high school exam to a college graduate - it doesn't show their true abilities.

What's the solution?

The researchers created HLE, which is like a super hard final exam covering many subjects including math, humanities, and sciences. They asked experts from around the world to make 3,000 tough questions that even the best AIs struggle with. These questions can't be answered by just searching the internet, and they all have clear, correct answers that can be checked automatically. When they tested the best AI systems on HLE, the AIs did poorly, showing there's still a big gap between AI and human experts.

Why it matters?

This matters because as AI gets smarter, we need good ways to measure its abilities. HLE helps us understand what AI can and can't do compared to human experts. This information is crucial for researchers who are developing AI and for policymakers who need to make decisions about how AI should be used in society. By making HLE available to everyone, the researchers are helping the whole scientific community better understand and prepare for advances in AI technology.

Abstract

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.

View Paper