AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite
Jonathan Bragg, Mike D'Arcy, Nishant Balepur, Dan Bareket, Bhavana Dalvi, Sergey Feldman, Dany Haddad, Jena D. Hwang, Peter Jansen, Varsha Kishore, Bodhisattwa Prasad Majumder, Aakanksha Naik, Sigal Rahamimov, Kyle Richardson, Amanpreet Singh, Harshit Surana, Aryeh Tiktinsky, Rosni Vasu, Guy Wiener, Chloe Anastasiades, Stefan Candra, Jason Dunkelberger
2025-10-27
Summary
This paper discusses the need for better ways to test AI agents designed to help with scientific research, and introduces a new testing suite called AstaBench to do just that.
What's the problem?
Currently, evaluating AI agents for science is difficult because existing tests don't accurately reflect real scientific work, aren't easily repeatable by different researchers, don't consider costs associated with using the AI, lack standard ways to build and test agents quickly, and don't have good baseline agents to compare against. Basically, it's hard to tell if these AI tools are *actually* helpful or just getting lucky.
What's the solution?
The researchers created AstaBench, a large collection of over 2400 science-related problems covering the whole research process, from finding information to proposing new ideas. They also built a special research environment with tools that allow for fair and repeatable testing, and provided a set of AI agents to use as comparisons. They then tested 57 different AI agents using this new system.
Why it matters?
This work is important because it provides a more reliable way to measure the progress of AI in science. The results show that while AI is improving in some areas, it's still quite a ways off from being a truly effective assistant for scientists, and highlights where future development needs to focus.
Abstract
AI agents hold the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new directions of inquiry; indeed, there are now many such agents, ranging from general-purpose "deep research" systems to specialized science-specific agents, such as AI Scientist and AIGS. Rigorous evaluation of these agents is critical for progress. Yet existing benchmarks fall short on several fronts: they (1) fail to provide holistic, product-informed measures of real-world use cases such as science research; (2) lack reproducible agent tools necessary for a controlled comparison of core agentic capabilities; (3) do not account for confounding variables such as model cost and tool access; (4) do not provide standardized interfaces for quick agent prototyping and evaluation; and (5) lack comprehensive baseline agents necessary to identify true advances. In response, we define principles and tooling for more rigorously benchmarking agents. Using these, we present AstaBench, a suite that provides the first holistic measure of agentic ability to perform scientific research, comprising 2400+ problems spanning the entire scientific discovery process and multiple scientific domains, and including many problems inspired by actual user requests to deployed Asta agents. Our suite comes with the first scientific research environment with production-grade search tools that enable controlled, reproducible evaluation, better accounting for confounders. Alongside, we provide a comprehensive suite of nine science-optimized classes of Asta agents and numerous baselines. Our extensive evaluation of 57 agents across 22 agent classes reveals several interesting findings, most importantly that despite meaningful progress on certain individual aspects, AI remains far from solving the challenge of science research assistance.