DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, Meiyi Qiang, Yalin Feng, Tianyi Bai, Zewei Pan, Ziyi Guo, Yizhen Jiang, Jingwen Deng, Qijie You, Peichao Lai, Tianyu Guo, Chi Hsu Tsai, Hengyi Feng

2025-12-23

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

Summary

This paper introduces DataFlow, a new system for creating the datasets used to train and improve Large Language Models (LLMs). It's designed to make the process of getting data ready for LLMs more organized, repeatable, and effective.

What's the problem?

Currently, preparing data for LLMs is often done with messy, one-off scripts. This makes it hard to share data preparation methods, reproduce results, and automatically improve the data based on how the LLM is performing. It’s like trying to build with LEGOs when all the pieces are different shapes and there are no instructions.

What's the solution?

The researchers built DataFlow, which is like a set of standardized LEGO bricks and instructions for data preparation. It provides reusable components and a clear way to build 'data pipelines' – sequences of steps to clean, transform, and generate data. They also created DataFlow-Agent, which can automatically build these pipelines from simple English instructions. It’s essentially a system that lets you describe what kind of data you need, and it figures out how to create it.

Why it matters?

This work is important because better data leads to better LLMs. DataFlow consistently improved the performance of LLMs on various tasks like math, coding, and understanding natural language questions. It even allowed models trained on a smaller, but higher-quality, dataset created by DataFlow to outperform models trained on much larger, but less refined, datasets. This means we can potentially build powerful LLMs without needing massive amounts of data, making AI development more accessible and efficient.

Abstract

The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3\% execution accuracy in Text-to-SQL over SynSQL, +7\% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.

View Paper