ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
Hongwei Liu, Junnan Liu, Shudong Liu, Haodong Duan, Yuqiang Li, Mao Su, Xiaohong Liu, Guangtao Zhai, Xinyu Fang, Qianhong Ma, Taolin Zhang, Zihan Ma, Yufeng Zhao, Peiheng Zhou, Linchen Xiao, Wenlong Zhang, Shijie Zhou, Xingjian Ma, Siqi Sun, Jiaye Ge, Meng Li, Yuhong Liu
2025-11-19
Summary
This paper introduces ATLAS, a new and challenging benchmark designed to truly test the scientific reasoning abilities of Large Language Models (LLMs). Current benchmarks are becoming too easy for advanced models and often don't reflect real-world scientific problems.
What's the problem?
Existing ways to measure how good LLMs are at science are falling short. Many models are getting near-perfect scores on older tests, making it hard to tell which ones are actually improving. Also, these tests often focus on just one area of science, ask simple questions, or have been accidentally 'spoiled' by including information the models were already trained on, meaning they aren't a true measure of understanding.
What's the solution?
The researchers created ATLAS, a collection of about 800 original science problems across seven different fields – math, physics, chemistry, biology, computer science, earth science, and materials science. These problems require complex, multi-step reasoning and answers aren't just multiple choice; they need detailed explanations and even use scientific notation (LaTeX). They also used a team of experts to make sure the questions were high quality, difficult, and hadn't been seen before by the models. Finally, they developed a way to automatically grade the complex answers using other LLMs as judges.
Why it matters?
ATLAS is important because it provides a more reliable way to measure the progress of LLMs towards true Artificial General Intelligence (AGI). It’s a much tougher test that better reflects the kind of complex, cross-disciplinary thinking scientists do. By creating a consistently challenging benchmark, ATLAS can help researchers focus on developing models that genuinely *understand* science, rather than just memorizing facts.
Abstract
The rapid advancement of Large Language Models (LLMs) has led to performance saturation on many established benchmarks, questioning their ability to distinguish frontier models. Concurrently, existing high-difficulty benchmarks often suffer from narrow disciplinary focus, oversimplified answer formats, and vulnerability to data contamination, creating a fidelity gap with real-world scientific inquiry. To address these challenges, we introduce ATLAS (AGI-Oriented Testbed for Logical Application in Science), a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems. Developed by domain experts (PhD-level and above), ATLAS spans seven core scientific fields: mathematics, physics, chemistry, biology, computer science, earth science, and materials science. Its key features include: (1) High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage; (2) Cross-Disciplinary Focus, designed to assess models' ability to integrate knowledge and reason across scientific domains; (3) High-Fidelity Answers, prioritizing complex, open-ended answers involving multi-step reasoning and LaTeX-formatted expressions over simple multiple-choice questions; and (4) Rigorous Quality Control, employing a multi-stage process of expert peer review and adversarial testing to ensure question difficulty, scientific value, and correctness. We also propose a robust evaluation paradigm using a panel of LLM judges for automated, nuanced assessment of complex answers. Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities. We plan to develop ATLAS into a long-term, open, community-driven platform to provide a reliable "ruler" for progress toward Artificial General Intelligence.