QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining

Fengze Liu, Weidong Zhou, Binbin Liu, Zhimiao Yu, Yifan Zhang, Haobin Lin, Yifeng Yu, Xiaohuan Zhou, Taifeng Wang, Yong Cao

2025-04-25

QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM
Pretraining

Summary

This paper talks about QuaDMix, a new system for picking out the best mix of training data for large language models so that the AI learns from both high-quality and diverse examples.

What's the problem?

The problem is that if you only train a language model on super high-quality data, it might not learn to handle different types of writing or topics. But if you use only diverse data, you might end up with a lot of low-quality information that confuses the model or makes it less accurate.

What's the solution?

The researchers created QuaDMix, which is a method that carefully selects training data by balancing both quality and diversity. This way, the language model gets exposed to a wide range of topics and styles, but still learns from examples that are clear and reliable. This leads to better overall performance when the model is used for real-world tasks.

Why it matters?

This matters because it helps AI models become smarter and more adaptable, making them better at understanding and generating all kinds of text, from creative stories to technical explanations, which is useful for everyone who relies on AI for information or communication.

Abstract

A unified data selection framework called QuaDMix optimizes the distribution of training data for large language models by balancing quality and diversity, leading to improved performance.

View Paper