Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs
Yu Li, Xiaoran Shang, Qizhi Pei, Yun Zhu, Xin Gao, Honglin Lin, Zhanping Zhong, Zhuoshi Pan, Zheng Liu, Xiaoyang Wang, Conghui He, Dahua Lin, Feng Zhao, Lijun Wu
2026-04-14
Summary
This paper investigates how the data used to fine-tune Large Language Models (LLMs) is created and connected, arguing that we often treat these datasets as separate entities when they actually have a history of being built upon and modified from each other.
What's the problem?
Currently, when building datasets to improve LLMs, there's a lack of understanding about how those datasets relate to each other. Datasets aren't tracked in terms of their origins and how they've evolved, leading to issues like unnecessary repetition of information and accidentally including 'answers' within the training data itself, which can artificially inflate performance on tests. Essentially, we don't know where the data *comes* from and how it's been changed over time.
What's the solution?
The researchers developed a system using multiple 'agents' that automatically maps out the relationships between different datasets, creating a kind of family tree showing how they've been built and modified. They then used this map to create a *new* dataset, carefully selecting data from the original sources to avoid repetition and contamination. This new dataset prioritizes diversity by going back to the original data sources instead of relying on already-modified versions.
Why it matters?
This work is important because it provides a way to systematically manage and improve the data used to train LLMs. By understanding the 'lineage' of datasets, we can build better, more reliable, and less biased models. It moves data curation from a somewhat random process to a more organized and controlled one, ultimately leading to more trustworthy AI systems.
Abstract
Post-training data plays a pivotal role in shaping the capabilities of Large Language Models (LLMs), yet datasets are often treated as isolated artifacts, overlooking the systemic connections that underlie their evolution. To disentangle these complex relationships, we introduce the concept of data lineage to the LLM ecosystem and propose an automated multi-agent framework to reconstruct the evolutionary graph of dataset development. Through large-scale lineage analysis, we characterize domain-specific structural patterns, such as vertical refinement in math-oriented datasets and horizontal aggregation in general-domain corpora. Moreover, we uncover pervasive systemic issues, including structural redundancy induced by implicit dataset intersections and the propagation of benchmark contamination along lineage paths. To demonstrate the practical value of lineage analysis for data construction, we leverage the reconstructed lineage graph to create a lineage-aware diversity-oriented dataset. By anchoring instruction sampling at upstream root sources, this approach mitigates downstream homogenization and hidden redundancy, yielding a more diverse post-training corpus. We further highlight lineage-centric analysis as an efficient and robust topological alternative to sample-level dataset comparison for large-scale data ecosystems. By grounding data construction in explicit lineage structures, our work advances post-training data curation toward a more systematic and controllable paradigm.