ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems
Xin Gui, King Zhu, JinCheng Ren, Qianben Chen, Zekun Moore Wang, Yizhi LI, Xinpeng Liu, Xiaowan Li, Wenli Ren, Linyu Miao, Tianrui Qin, Ziqi Shu, He Zhu, Xiangru Tang, Dingfeng Shi, Jiaheng Liu, Yuchen Eleanor Jiang, Minghao Liu, Ge Zhang, Wangchunshu Zhou
2025-10-14
Summary
This paper introduces a new, difficult test called Acadreason to see how well large language models (LLMs) and AI agents can handle complex, academic-level reasoning.
What's the problem?
Currently, there aren't good ways to accurately measure how well LLMs and agents can *really* think through tough problems that require understanding specialized knowledge. Existing tests either focus on things like math competitions or are too simple for advanced reasoning, and academic tests don't push the AI enough. Researchers needed a benchmark that truly tests high-level reasoning skills.
What's the solution?
The researchers created Acadreason, a set of 50 challenging questions from five different academic fields – computer science, economics, law, math, and philosophy. These questions come from recent, respected academic publications and were carefully checked by experts to make sure they're both difficult and solvable. They then tested over ten different LLMs and agents on these questions to see how they performed.
Why it matters?
The results showed that even the most advanced AI models struggled, scoring quite low on the test. This highlights that there's still a significant gap between what AI can do and what's needed for truly intelligent academic research, and Acadreason provides a valuable tool for measuring progress in this area.
Abstract
In recent years, the research focus of large language models (LLMs) and agents has shifted increasingly from demonstrating novel capabilities to complex reasoning and tackling challenging tasks. However, existing evaluations focus mainly on math/code contests or general tasks, while existing multi-domain academic benchmarks lack sufficient reasoning depth, leaving the field without a rigorous benchmark for high-level reasoning. To fill this gap, we introduce the Acadreason benchmark, designed to evaluate the ability of LLMs and agents to acquire and reason over academic knowledge. It consists of 50 expert-annotated academic problems across five high-reasoning domains, including computer science, economics, law, mathematics, and philosophy. All questions are sourced from top-tier publications in recent years and undergo rigorous annotation and quality control to ensure they are both challenging and answerable. We conduct systematic evaluations of over 10 mainstream LLMs and agents. The results show that most LLMs scored below 20 points, with even the cutting-edge GPT-5 achieving only 16 points. While agents achieved higher scores, none exceeded 40 points. This demonstrates the current capability gap between LLMs and agents in super-intelligent academic research tasks and highlights the challenges of Acadreason.