SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, Kang Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixing Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li
2025-02-21
Summary
This paper talks about SuperGPQA, a new way to test how smart AI language models are across a wide range of subjects, including many specialized fields that aren't usually tested. It's like creating a super-hard quiz that covers almost every college major to see how well AI can handle tough questions in different areas.
What's the problem?
Current tests for AI language models mostly focus on common subjects like math and science. But there are over 200 specialized fields of study that these tests don't cover. This means we don't really know how well AI can handle questions about things like farming, specific industries, or service jobs. It's like only testing a student on basic subjects and not seeing if they can handle more specialized classes.
What's the solution?
The researchers created SuperGPQA, a huge test that covers 285 different college-level subjects. They used a clever system where humans and AI worked together to make sure the questions were good - not too easy or confusing. They had over 80 experts help create and check the questions. Then they tested some of the best AI models to see how well they did on this new, more comprehensive test.
Why it matters?
This matters because it gives us a much better idea of what AI can really do across many different fields. It showed that even the best AI models still have a lot of room for improvement, especially in specialized areas. This helps researchers understand what AI is good at and where it needs to get better. It's like getting a report card that shows exactly what subjects an AI needs to study more, which can guide future research and development to make AI smarter and more useful in all areas of knowledge.
Abstract
Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.