DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models
Yiming Huang, Jianwen Luo, Yan Yu, Yitong Zhang, Fangyu Lei, Yifan Wei, Shizhu He, Lifu Huang, Xiao Liu, Jun Zhao, Kang Liu
2024-10-14

Summary
This paper introduces DA-Code, a benchmark designed to evaluate how well large language models (LLMs) can generate code for complex data science tasks that involve multiple steps.
What's the problem?
As data science becomes more important, there is a need for models that can write code to handle complex tasks involving data analysis, machine learning, and data wrangling. However, existing benchmarks have not effectively tested LLMs on these challenging tasks, which require advanced coding skills and the ability to work with real-world data.
What's the solution?
DA-Code addresses this problem by creating a set of 500 challenging tasks based on real data that require LLMs to perform various data science activities. The benchmark includes tasks in three main areas: data wrangling (cleaning and organizing data), exploratory data analysis (analyzing data to gain insights), and machine learning. Each task is designed to test the model's ability to write code in languages like Python and SQL while interacting with a controlled environment that mimics real-world scenarios.
Why it matters?
This research is significant because it provides a new way to measure the capabilities of language models in generating code for complex data science tasks. By improving how we evaluate these models, DA-Code can help developers create better tools for data analysis and enhance the overall performance of AI in handling real-world problems.
Abstract
We introduce DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks and demanding advanced coding skills in grounding and planning. Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex data science programming languages, to perform intricate data processing and derive the answers. We set up the benchmark in a controllable and executable environment that aligns with real-world data analysis scenarios and is scalable. The annotators meticulously design the evaluation suite to ensure the accuracy and robustness of the evaluation. We develop the DA-Agent baseline. Experiments show that although the baseline performs better than other existing frameworks, using the current best LLMs achieves only 30.5% accuracy, leaving ample room for improvement. We release our benchmark at https://da-code-bench.github.io.