ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?
Christine Ye, Sihan Yuan, Suchetha Cooray, Steven Dillmann, Ian L. V. Roque, Dalya Baron, Philipp Frank, Sergio Martin-Alvarez, Nolan Koblischke, Frank J Qu, Diyi Yang, Risa Wechsler, Ioana Ciuca
2025-10-29
Summary
This paper introduces a way to test how well artificial intelligence (AI) agents can actually *do* scientific research, specifically by trying to recreate the work in existing research papers.
What's the problem?
As AI gets better, people are hoping it can help with scientific discoveries, but we don't really have a good way to check if the AI is actually being accurate and reliable. It's not enough for an AI to just *say* it's doing research; we need to know if it's following the correct methods and getting the right answers. Current AI models aren't very good at complex tasks like replicating entire research projects.
What's the solution?
The researchers created a benchmark called ReplicationBench. They took actual astrophysics papers and broke them down into smaller tasks – things like setting up experiments, doing calculations, analyzing data, and writing code. They then asked AI agents to complete these tasks, and had the original authors of the papers check the AI’s work for both faithfulness (did it follow the original steps?) and correctness (did it get the right results?).
Why it matters?
This work is important because it provides a standardized and rigorous way to evaluate AI’s ability to perform scientific research. It shows where current AI models struggle, which helps researchers improve them. Because astrophysics relies heavily on data analysis and computation, it’s a good field to test AI, and the lessons learned can likely be applied to other scientific areas as well. Ultimately, this helps us understand how close we are to having AI that can truly assist with scientific discovery.
Abstract
Frontier AI agents show increasing promise as scientific research assistants, and may eventually be useful for extended, open-ended research workflows. However, in order to use agents for novel research, we must first assess the underlying faithfulness and correctness of their work. To evaluate agents as research assistants, we introduce ReplicationBench, an evaluation framework that tests whether agents can replicate entire research papers drawn from the astrophysics literature. Astrophysics, where research relies heavily on archival data and computational study while requiring little real-world experimentation, is a particularly useful testbed for AI agents in scientific research. We split each paper into tasks which require agents to replicate the paper's core contributions, including the experimental setup, derivations, data analysis, and codebase. Each task is co-developed with the original paper authors and targets a key scientific result, enabling objective evaluation of both faithfulness (adherence to original methods) and correctness (technical accuracy of results). ReplicationBench is extremely challenging for current frontier language models: even the best-performing language models score under 20%. We analyze ReplicationBench trajectories in collaboration with domain experts and find a rich, diverse set of failure modes for agents in scientific research. ReplicationBench establishes the first benchmark of paper-scale, expert-validated astrophysics research tasks, reveals insights about agent performance generalizable to other domains of data-driven science, and provides a scalable framework for measuring AI agents' reliability in scientific research.