SciCode: A Research Coding Benchmark Curated by Scientists

Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu Yin, Minhui Zhu, Kilian Lieret

2024-07-22

SciCode: A Research Coding Benchmark Curated by Scientists

Summary

This paper presents SciCode, a new coding benchmark created to evaluate how well language models (LMs) can generate code for real scientific problems. It was developed with input from scientists across various fields to ensure it reflects actual research challenges.

What's the problem?

As language models have become better at coding tasks, it's harder to create effective tests that accurately measure their abilities. Many existing benchmarks are not challenging enough or do not reflect real-world scientific problems. This makes it difficult to assess how well these models can assist in scientific research, which requires complex reasoning and problem-solving skills.

What's the solution?

To address this issue, the authors created SciCode, which includes 338 subproblems derived from 80 main scientific problems across 16 different fields, such as mathematics, physics, and biology. Each problem is designed to test knowledge recall, reasoning, and code synthesis. The benchmark also provides background information and high-quality solutions for evaluation. The results showed that even the best model tested, Claude3.5-Sonnet, could only solve 4.6% of the problems in realistic settings, indicating that the benchmark is quite challenging.

Why it matters?

This research is important because it helps improve the evaluation of language models in scientific contexts, ensuring they can effectively assist researchers. By creating a benchmark that reflects real-world challenges, SciCode paves the way for better AI tools in science, ultimately supporting advancements in research and technology.

Abstract

Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities to generate code for solving real scientific research problems. Incorporating input from scientists and AI researchers in 16 diverse natural science sub-fields, including mathematics, physics, chemistry, biology, and materials science, we created a scientist-curated coding benchmark, SciCode. The problems in SciCode naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems. It offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. We believe that SciCode demonstrates both contemporary LMs' progress towards becoming helpful scientific assistants and sheds light on the development and evaluation of scientific AI in the future.

View Paper