DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

Fangyu Lei, Jinxiang Meng, Yiming Huang, Junjie Zhao, Yitong Zhang, Jianwen Luo, Xin Zou, Ruiyi Yang, Wenbo Shi, Yan Gao, Shizhu He, Zuo Wang, Qian Liu, Yang Wang, Ke Wang, Jun Zhao, Kang Liu

2025-12-05

DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

Summary

This paper introduces DAComp, a new benchmark designed to test how well AI systems can handle real-world data tasks, from preparing data to actually analyzing it and making decisions.

What's the problem?

Currently, AI is good at writing code, but struggles with the entire process of working with data in a business setting. This includes not just writing SQL queries to get data ready, but also understanding what questions to ask, exploring the data, and then turning that into useful advice. Existing benchmarks don't accurately reflect the complexity of these real-world data workflows, and current AI systems aren't performing well when faced with these challenges, especially the data preparation stage.

What's the solution?

The researchers created DAComp, which includes 210 different tasks that mimic how data is used in companies. Some tasks involve building data pipelines from scratch using SQL, while others require analyzing data to answer open-ended business questions. They developed a way to automatically score the data preparation tasks and used a sophisticated AI 'judge' to evaluate the quality of the analysis and recommendations, based on detailed guidelines. They then tested several advanced AI systems on these tasks.

Why it matters?

DAComp is important because it clearly shows where AI systems are falling short in the field of data science. It's not just about generating code; it's about understanding the bigger picture and being able to reason about data. By providing a realistic and challenging benchmark, DAComp will help researchers develop better AI tools that can truly assist people in making data-driven decisions in businesses.

Abstract

Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at https://da-comp.github.io

View Paper