Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents

Patrick Tser Jern Kon, Jiachen Liu, Qiuyi Ding, Yiming Qiu, Zhenning Yang, Yibo Huang, Jayanth Srinivasa, Myungjin Lee, Mosharaf Chowdhury, Ang Chen

2025-02-26

Curie: Toward Rigorous and Automated Scientific Experimentation with AI
Agents

Summary

This paper talks about Curie, a new AI system designed to help scientists do experiments more accurately and efficiently

What's the problem?

While AI has gotten really good at many tasks, it's still not great at doing scientific experiments in a careful and trustworthy way. Scientists need experiments to be reliable, well-controlled, and easy to understand, but current AI systems struggle with this

What's the solution?

The researchers created Curie, an AI framework that uses three main parts to make experiments more rigorous: one part makes sure each AI agent is reliable, another part keeps everything organized, and the third part helps explain what's happening. They tested Curie on 46 computer science questions and found it was much better at answering experimental questions than other AI systems

Why it matters?

This matters because it could help scientists do research faster and more accurately. By automating parts of the experimental process, Curie could free up researchers to focus on coming up with new ideas instead of spending so much time on repetitive tasks. This could lead to quicker scientific discoveries and advances in fields like computer science, medicine, and more

Abstract

Scientific experimentation, a cornerstone of human progress, demands rigor in reliability, methodical control, and interpretability to yield meaningful results. Despite the growing capabilities of large language models (LLMs) in automating different aspects of the scientific process, automating rigorous experimentation remains a significant challenge. To address this gap, we propose Curie, an AI agent framework designed to embed rigor into the experimentation process through three key components: an intra-agent rigor module to enhance reliability, an inter-agent rigor module to maintain methodical control, and an experiment knowledge module to enhance interpretability. To evaluate Curie, we design a novel experimental benchmark composed of 46 questions across four computer science domains, derived from influential research papers, and widely adopted open-source projects. Compared to the strongest baseline tested, we achieve a 3.4times improvement in correctly answering experimental questions.Curie is open-sourced at https://github.com/Just-Curieous/Curie.

View Paper