Data Efficacy for Language Model Training

Yalun Dai, Yangyu Huang, Xin Zhang, Wenshan Wu, Chong Li, Wenhui Lu, Shijie Cao, Li Dong, Scarlett Li

2025-07-02

Data Efficacy for Language Model Training

Summary

This paper talks about DELT, a new approach for training language models that focuses on organizing the training data better to get the best performance. It improves how models learn without needing more data or bigger models by scoring, selecting, and ordering training samples in a smart way.

What's the problem?

The problem is that training language models with lots of data doesn’t always mean they learn better. Sometimes the way the data is arranged or chosen during training can cause the model to forget or learn less efficiently, limiting its performance.

What's the solution?

The researchers designed DELT to optimize data during training by scoring each piece of data based on how easy and useful it is to learn from, selecting the best parts, and ordering the data to reduce forgetting and biases. This approach helps models learn more effectively using the same data and resources.

Why it matters?

This matters because it shows that how training data is organized can be just as important as how much data or how big the model is. Better data efficiency helps create smarter language models faster and cheaper, making AI technology more accessible.

Abstract

DELT, a paradigm for optimizing data organization in language model training, enhances performance through Data Scoring, Data Selection, and Data Ordering, with Learnability-Quality Scoring and Folding Ordering as key components.

View Paper