Large-Scale Data Selection for Instruction Tuning

Hamish Ivison, Muru Zhang, Faeze Brahman, Pang Wei Koh, Pradeep Dasigi

2025-03-04

Large-Scale Data Selection for Instruction Tuning

Summary

This paper talks about finding the best way to choose high-quality data for training AI language models to follow instructions better, especially when dealing with very large amounts of data.

What's the problem?

Current methods for selecting good training data for AI models are usually tested on small amounts of data, but real-world AI models need to learn from much larger datasets. Many of the existing methods don't work well when scaled up to these larger datasets, and some even perform worse than randomly selecting data.

What's the solution?

The researchers tested different data selection methods on much larger datasets, using up to 2.5 million samples from pools of up to 5.8 million samples. They found that a method called RDS+, which uses a special way of analyzing the hidden patterns in the data, works best across all their tests. This method is both more effective and uses less computer power than more complicated approaches.

Why it matters?

This research matters because it helps make AI language models better at understanding and following instructions, which is crucial for many real-world applications. By finding a more efficient way to select high-quality training data from very large datasets, we can create smarter AI models that perform better on a wide range of tasks, potentially leading to more capable and reliable AI assistants and tools.

Abstract

Selecting high-quality training data from a larger pool is a crucial step when instruction-tuning language models, as carefully curated datasets often produce models that outperform those trained on much larger, noisier datasets. Automated data selection approaches for instruction-tuning are typically tested by selecting small datasets (roughly 10k samples) from small pools (100-200k samples). However, popular deployed instruction-tuned models often train on hundreds of thousands to millions of samples, subsampled from even larger data pools. We present a systematic study of how well data selection methods scale to these settings, selecting up to 2.5M samples from pools of up to 5.8M samples and evaluating across 7 diverse tasks. We show that many recently proposed methods fall short of random selection in this setting (while using more compute), and even decline in performance when given access to larger pools of data to select over. However, we find that a variant of representation-based data selection (RDS+), which uses weighted mean pooling of pretrained LM hidden states, consistently outperforms more complex methods across all settings tested -- all whilst being more compute-efficient. Our findings highlight that the scaling properties of proposed automated selection methods should be more closely examined. We release our code, data, and models at https://github.com/hamishivi/automated-instruction-selection.

View Paper