Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training

Yiwei Qin, Zhen Huang, Tiantian Mi, Weiye Si, Chenyang Zhou, Qipeng Guo, Siyuan Feng, Pengfei Liu

2026-02-17

Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training

Summary

This paper introduces a new way to think about improving the data used to train large AI models, arguing that better models can actually *create* better data for future models, leading to a cycle of improvement.

What's the problem?

Currently, there isn't a structured approach to consistently improve the quality of data used for training these powerful AI models. While we know good data is crucial for good performance, there's a gap in how we systematically process and refine that data, especially in complex fields like scientific research where the language can be very specific and difficult for AI to understand.

What's the solution?

The researchers developed a system called 'Data Darwinism,' which outlines ten levels of data refinement, starting with raw data and progressively improving it using advanced AI models. They specifically focused on scientific literature, creating a large dataset called 'Darwin-Science.' They used powerful language models to not just rewrite the text, but to actually *explain* the reasoning and specialized terms within it, making it easier for AI to learn. To be sure their results weren't just because the models had already seen similar data, they trained new models from scratch without any prior scientific knowledge and then continued training them on the improved dataset.

Why it matters?

This work is important because it shows a clear path to building even better AI models. By systematically improving the data they learn from, and by releasing both the improved dataset and the models they trained, the researchers are providing the tools for others to continue this cycle of improvement, potentially leading to breakthroughs in many scientific fields.

Abstract

Data quality determines foundation model performance, yet systematic processing frameworks are lacking. We introduce Data Darwinism, a ten-level taxonomy (L0-L9) that conceptualizes data-model co-evolution: advanced models produce superior data for next-generation systems. We validate this on scientific literature by constructing Darwin-Science, a 900B-token corpus (L0-L5). We identify a learnability gap in raw scientific text, which we bridge via L4 (Generative Refinement) and L5 (Cognitive Completion) using frontier LLMs to explicate reasoning and terminology. To ensure rigorous attribution, we pre-trained daVinci-origin-3B/7B models from scratch, excluding scientific content to create contamination-free baselines. After 600B tokens of continued pre-training, Darwin-Science outperforms baselines by +2.12 (3B) and +2.95 (7B) points across 20+ benchmarks, rising to +5.60 and +8.40 points on domain-aligned tasks. Systematic progression to L5 yields a +1.36 total gain, confirming that higher-level processing unlocks latent data value. We release the Darwin-Science corpus and daVinci-origin models to enable principled, co-evolutionary development.

View Paper