Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

Chenkai Pan, Xinglong Xu, Yuhang Xu, Yujun Wu, Siyuan Li, Jintao Chen, Conghui He, Jingxuan Wei, Cheng Tan

2026-04-29

Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

Summary

This paper addresses the difficulty of getting large language models (LLMs) to reliably use specialized knowledge found in text, like what an expert in a specific field would know.

What's the problem?

Currently, when you try to teach an LLM something specific by showing it examples, it's a bit of a guessing game. If the model gets something wrong, you just add more examples hoping it will learn, but you don't really know *why* it failed or what specific information is missing from the training data. It's like trying to fix a computer program without knowing where the bug is.

What's the solution?

The researchers found a way to make the process more like software development. They created a structured way to represent the knowledge from the text, almost like writing out the rules the model should follow. Then, when the model makes a mistake, they can pinpoint exactly which rule is causing the problem and fix the training data accordingly. This 'Programming with Data' approach allows for targeted improvements and consistent results across different models and fields, and they’ve released the tools and data they used for others to build upon.

Why it matters?

This work is important because it provides a more reliable and systematic way to build LLMs that can truly leverage human expertise. Instead of just throwing more data at the problem, we can now diagnose and fix specific knowledge gaps, leading to more accurate and trustworthy AI systems in areas like science, medicine, and engineering.

Abstract

Reliably transferring specialized human knowledge from text into large language models remains a fundamental challenge in artificial intelligence. Fine-tuning on domain corpora has enabled substantial capability gains, but the process operates without feedback: when a model fails on a domain task, there is no method to diagnose what is deficient in the training data, and the only recourse is to add more data indiscriminately. Here we show that when a structured knowledge representation extracted from the source corpus serves as the shared foundation for both training data and evaluation, the complete data-engineering lifecycle maps onto the software development lifecycle in a precise and operative way: training data becomes source code specifying what the model should learn, model training becomes compilation, benchmarking becomes unit testing, and failure-driven data repair becomes debugging. Under this correspondence, model failures decompose into concept-level gaps and reasoning-chain breaks that can be traced back to specific deficiencies in the data and repaired through targeted patches, with each repair cycle producing consistent improvements across model scales and architectures without degrading general capabilities. We formalize this principle as Programming with Data and instantiate it across sixteen disciplines spanning the natural sciences, engineering, biomedicine, and the social sciences, releasing a structured knowledge base, benchmark suite, and training corpus as open resources. By demonstrating that the relationship between training data and model behaviour is structurally traceable and systematically repairable, this work establishes a principled foundation for the reliable engineering of human expertise into language models.

View Paper