DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

Hao Liang, Zhengyang Zhao, Meiyi Qiang, Mingrui Chen, Lu Ma, Rongyi Yu, Hengyi Feng, Shixuan Sun, Zimo Meng, Xiaochen Ma, Xuanlin Yang, Qifeng Cai, Ruichuan An, Bohan Zeng, Zhen Hao Wong, Chengyu Shen, Runming He, Zhaoyang Han, Yaowei Zheng, Fangcheng Fu, Conghui He, Bin Cui

2026-04-03

DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

Summary

This paper introduces DataFlex, a new system designed to make it easier to improve large language models (LLMs) by focusing on the data they're trained on, not just the model itself.

What's the problem?

Currently, improving LLMs often involves tweaking the training data – things like choosing which data to use, how much of each type, and how important each piece of data is. However, the tools for doing this are scattered and don't work well together, making it hard to compare different data improvement techniques, reproduce results, and actually use these methods in real-world training.

What's the solution?

DataFlex solves this by creating a single, unified framework built on an existing LLM training system called LLaMA-Factory. It supports three main ways to improve data: selecting the best examples, adjusting the mix of different data sources, and reweighting the importance of individual examples. Importantly, it’s designed to easily replace the standard training process without major changes and works efficiently even with very large models and datasets.

Why it matters?

DataFlex is important because it provides a reliable and efficient way to experiment with and implement data-centric training methods. The experiments show that dynamically choosing and weighting data during training consistently leads to better performance on tasks like answering questions and understanding language, and it does so faster than existing methods, making it a practical tool for building better LLMs.

Abstract

Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. In this paper, we present DataFlex, a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms of dynamic data optimization: sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original training workflow. It provides extensible trainer abstractions and modular components, enabling a drop-in replacement for standard LLM training, and unifies key model-dependent operations such as embedding extraction, inference, and gradient computation, with support for large-scale settings including DeepSpeed ZeRO-3. We conduct comprehensive experiments across multiple data-centric methods. Dynamic data selection consistently outperforms static full-data training on MMLU across both Mistral-7B and Llama-3.2-3B. For data mixture, DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity over default proportions when pretraining Qwen2.5-1.5B on SlimPajama at 6B and 30B token scales. DataFlex also achieves consistent runtime improvements over original implementations. These results demonstrate that DataFlex provides an effective, efficient, and reproducible infrastructure for data-centric dynamic training of LLMs.

View Paper