RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs
Baolong Bi, Shenghua Liu, Xingzhang Ren, Dayiheng Liu, Junyang Lin, Yiwei Wang, Lingrui Mei, Junfeng Fang, Jiafeng Guo, Xueqi Cheng
2025-07-08
Summary
This paper talks about RefineX, a new method that uses expert-guided programs to improve and clean large amounts of pre-training data for large language models. These programs automatically fix and enhance text data before it is used for training.
What's the problem?
The problem is that pre-training data for language models often contains errors, inconsistencies, or low-quality text, which can hurt the performance of the models trained on them. Fixing this data manually is too time-consuming and expensive.
What's the solution?
The researchers designed RefineX to automatically apply editing rules and expert knowledge through programmatic tasks that improve the quality of massive pre-training datasets. This approach refines the data efficiently at scale, leading to better data quality without needing manual effort.
Why it matters?
This matters because better training data leads to smarter and more accurate language models, which can improve AI performance across many applications like writing, translation, and answering questions.
Abstract
RefineX is a framework that uses programmatic editing tasks to efficiently refine large-scale pre-training data for LLMs, improving text quality and performance across various tasks.