PRBench: End-to-end Paper Reproduction in Physics Research
Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu, Mao Li, Zeyu Li, Zhaolong Zhang, Huiwen Zheng, Leidong Bao, Anqi Lv, Zihan Mo, Yadi Niu, Yiyang Peng, Yu Tian, Yili Wang, Ziyu Wang, Zi-Yu Wang, Jiashen Wei, Liuheng Wu, Aoran Xue, Leyi Yang
2026-03-31
Summary
This paper investigates how well AI agents, specifically those using large language models, can actually *do* science by independently reproducing results from published research papers, not just talk about it.
What's the problem?
While AI agents are getting good at things that *look* like scientific tasks – like writing formulas or code – it’s unclear if they can take a real scientific paper, understand the methods described, write the code to run those methods themselves, and get the same results as the original researchers. Essentially, can they perform a complete scientific experiment based solely on a paper's description?
What's the solution?
The researchers created a benchmark called PRBench, which includes 30 challenging tasks from different areas of physics. These tasks aren’t just theoretical; they require the AI to actually *implement* the experiments described in real, published papers and match the original results. They gave the AI only the paper and instructions, and ran the code in a secure environment to prevent cheating. They then tested several AI coding agents, including one powered by OpenAI’s GPT-5.3-Codex, to see how well they could perform.
Why it matters?
This work is important because it provides a tough, realistic test for AI’s ability to contribute to scientific research. The results show that current AI agents still struggle significantly with this kind of end-to-end reproduction, making mistakes in coding, data handling, and even sometimes making up results. PRBench helps us understand where AI needs to improve to become a truly useful tool for scientists.
Abstract
AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics. Each task requires an agent to comprehend the methodology of a published paper, implement the corresponding algorithms from scratch, and produce quantitative results matching the original publication. Agents are provided only with the task instruction and paper content, and operate in a sandboxed execution environment. All tasks are contributed by domain experts from over 20 research groups at the School of Physics, Peking University, each grounded in a real published paper and validated through end-to-end reproduction with verified ground-truth results and detailed scoring rubrics. Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution. The best-performing agent, OpenAI Codex powered by GPT-5.3-Codex, achieves a mean overall score of 34%. All agents exhibit a zero end-to-end callback success rate, with particularly poor performance in data accuracy and code correctness. We further identify systematic failure modes, including errors in formula implementation, inability to debug numerical simulations, and fabrication of output data. Overall, PRBench provides a rigorous benchmark for evaluating progress toward autonomous scientific research.