A Survey of Data Agents: Emerging Paradigm or Overstated Hype?

Yizhang Zhu, Liangwei Wang, Chenyu Yang, Xiaotian Lin, Boyan Li, Wei Zhou, Xinyu Liu, Zhangyang Peng, Tianqi Luo, Yu Li, Chengliang Chai, Chong Chen, Shimin Di, Ju Fan, Ji Sun, Nan Tang, Fugee Tsung, Jiannan Wang, Chenglin Wu, Yanwei Xu, Shaolei Zhang, Yong Zhang

2025-10-28

A Survey of Data Agents: Emerging Paradigm or Overstated Hype?

Summary

This paper is about the new idea of 'data agents,' which are programs designed to use both data and artificial intelligence to automatically handle complex data tasks, and it tries to clearly define what these agents actually are.

What's the problem?

Right now, the term 'data agent' is used in a lot of different ways, meaning some people think it's just a simple tool that answers questions, while others imagine a fully automated system. This confusion creates unrealistic expectations, makes it hard to figure out who's responsible when things go wrong, and slows down the development of the field as a whole.

What's the solution?

The researchers created a system for categorizing data agents, similar to how cars are ranked by their level of self-driving capability. They defined six levels, from completely manual control (level 0) to fully autonomous agents that can proactively find and use data (level 5). They then looked at existing data agent technologies and placed them within this framework, highlighting where the biggest challenges and opportunities for improvement lie, particularly in moving from agents that just follow instructions to those that can independently manage data tasks.

Why it matters?

This work is important because it provides a common language and understanding for data agents. By clearly defining what different levels of autonomy mean, it helps researchers and companies build more effective and reliable systems, and it sets a roadmap for the future development of these powerful tools, potentially leading to more efficient and insightful data analysis.

Abstract

The rapid advancement of large language models (LLMs) has spurred the emergence of data agents--autonomous systems designed to orchestrate Data + AI ecosystems for tackling complex data-related tasks. However, the term "data agent" currently suffers from terminological ambiguity and inconsistent adoption, conflating simple query responders with sophisticated autonomous architectures. This terminological ambiguity fosters mismatched user expectations, accountability challenges, and barriers to industry growth. Inspired by the SAE J3016 standard for driving automation, this survey introduces the first systematic hierarchical taxonomy for data agents, comprising six levels that delineate and trace progressive shifts in autonomy, from manual operations (L0) to a vision of generative, fully autonomous data agents (L5), thereby clarifying capability boundaries and responsibility allocation. Through this lens, we offer a structured review of existing research arranged by increasing autonomy, encompassing specialized data agents for data management, preparation, and analysis, alongside emerging efforts toward versatile, comprehensive systems with enhanced autonomy. We further analyze critical evolutionary leaps and technical gaps for advancing data agents, especially the ongoing L2-to-L3 transition, where data agents evolve from procedural execution to autonomous orchestration. Finally, we conclude with a forward-looking roadmap, envisioning the advent of proactive, generative data agents.

View Paper