DeepAnalyze: Agentic Large Language Models for Autonomous Data Science
Shaolei Zhang, Ju Fan, Meihao Fan, Guoliang Li, Xiaoyong Du
2025-10-21
Summary
This paper introduces DeepAnalyze-8B, a new artificial intelligence system designed to perform data science tasks completely on its own, from finding data to writing detailed reports, much like a human data scientist would.
What's the problem?
Traditionally, creating AI that can handle all the steps of data science – from understanding the initial question to analyzing data and drawing conclusions – has been really difficult. Existing AI systems usually follow a set path of instructions, which limits their ability to adapt to complex or unexpected situations in real-world data. They aren't truly 'autonomous' because they need someone to pre-define exactly what they should do.
What's the solution?
The researchers developed DeepAnalyze-8B and trained it using a method inspired by how people learn data science. They started with simple tasks and gradually increased the complexity, allowing the AI to build up its skills step-by-step. They also created a special process to generate high-quality training data specifically for this AI. This training allowed DeepAnalyze-8B to handle a wide range of data tasks, including answering questions about data, performing specific analyses, and even conducting open-ended research.
Why it matters?
DeepAnalyze-8B is significant because it shows that it's possible to create a relatively small AI (only 8 billion parameters) that can outperform larger, more complex systems in performing complete data science projects. Importantly, the researchers are making the AI, its code, and the data used to train it publicly available, which will help other researchers build on this work and accelerate the development of truly autonomous data science tools.
Abstract
Autonomous data science, from raw data sources to analyst-grade deep research reports, has been a long-standing challenge, and is now becoming feasible with the emergence of powerful large language models (LLMs). Recent workflow-based data agents have shown promising results on specific data tasks but remain fundamentally limited in achieving fully autonomous data science due to their reliance on predefined workflows. In this paper, we introduce DeepAnalyze-8B, the first agentic LLM designed for autonomous data science, capable of automatically completing the end-toend pipeline from data sources to analyst-grade deep research reports. To tackle high-complexity data science tasks, we propose a curriculum-based agentic training paradigm that emulates the learning trajectory of human data scientists, enabling LLMs to progressively acquire and integrate multiple capabilities in real-world environments. We also introduce a data-grounded trajectory synthesis framework that constructs high-quality training data. Through agentic training, DeepAnalyze learns to perform a broad spectrum of data tasks, ranging from data question answering and specialized analytical tasks to open-ended data research. Experiments demonstrate that, with only 8B parameters, DeepAnalyze outperforms previous workflow-based agents built on most advanced proprietary LLMs. The model, code, and training data of DeepAnalyze are open-sourced, paving the way toward autonomous data science.