Learning from the Best, Differently: A Diversity-Driven Rethinking on Data Selection

Hongyi He, Xiao Liu, Zhenghao Lin, Mingni Tang, Yi Cheng, Jintao Wang, Wenjie Li, Peng Cheng, Yeyun Gong

2025-10-23

Learning from the Best, Differently: A Diversity-Driven Rethinking on Data Selection

Summary

This paper focuses on how to choose the best data to train large language models (LLMs), like the ones powering chatbots. It argues that simply picking the 'highest quality' data isn't enough; you also need a lot of variety in that data.

What's the problem?

When training LLMs, people usually try to select data that seems good based on things like factual correctness and how well it's written. However, the paper shows that just picking the top-rated data actually *hurts* performance. This is because the different ways we measure 'good' data (like correctness and writing style) are often related, so picking the best overall can mean missing out on important diversity. It's like only choosing your friends based on how funny they are – you might miss out on friends who are good at giving advice or helping with problems.

What's the solution?

The researchers developed a method called ODiS, which stands for Orthogonal Diversity-Aware Selection. It works by first evaluating data on several different qualities, like language quality, factual accuracy, and how hard it is to understand. Then, it uses a mathematical technique called Principal Component Analysis (PCA) to make these qualities independent of each other. Finally, it picks the best data *within each independent quality*, ensuring a good mix of different types of information. They used another language model, Roberta, to help score the data efficiently.

Why it matters?

This research is important because it shows that simply focusing on 'quality' when building training data for LLMs is a mistake. You need to actively ensure diversity too. By using ODiS, the researchers were able to train models that performed significantly better on various tasks, demonstrating that a more balanced approach to data selection is crucial for creating powerful and reliable language models.

Abstract

High-quality pre-training data is crutial for large language models, where quality captures factual reliability and semantic value, and diversity ensures broad coverage and distributional heterogeneity. Existing approaches typically rely on single or multiple-dimensional score-based selection. However, directly selecting top-scored data often degrades performance, and sampling from a broader range is required to recover results. The above non-monotonicity between dataset scores and downstream benchmark results reveals a fundamental bias: score-based methods collapse correlated dimensions, causing top-scored data to appear high-quality while systematically overlooking diversity. We argue that ensuring diversity requires decomposing correlated metrics into orthogonal feature dimensions, from which the top-scored data can be directly selected. Therefore, we proposed the Orthogonal Diversity-Aware Selection (ODiS) algorithm, which preserves both quality and diversity during data selection. First, ODiS evaluates data from multiple dimensions, covering language quality, knowledge quality, and comprehension difficulty. The multi-dimensional scores are then decorrelated via Principal Component Analysis (PCA), yielding orthogonal evaluation dimensions. For each dimension, a Roberta-based scorer is trained to regress the data onto PCA-projected scores, enabling scalable inference on large corpora. Finally, ODiS constructs the training dataset by selecting top-scored data within each orthogonal dimension, thereby ensuring both quality and diversity. Empirical results show that ODiS-selected data exhibit less than 2\% inter-dimension overlap, confirming orthogonality between dimensions. More importantly, models trained with ODiS-selected data significantly outperform other baselines on downstream benchmarks, highlighting the necessity of orthogonal, diversity-aware data selection for LLMs.

View Paper