CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training

David Brandfonbrener, Hanlin Zhang, Andreas Kirsch, Jonathan Richard Schwarz, Sham Kakade

2024-06-18

CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training

Summary

This paper presents CoLoR-Filter, a new method for selecting high-quality data to train language models more effectively. It focuses on choosing the best examples from a large dataset to improve how well these models perform on specific tasks.

What's the problem?

When training language models, having the right data is crucial for their success. However, finding the best subset of data to use can be very challenging and is often considered a difficult problem. Many existing methods may not efficiently identify the most useful examples, leading to wasted resources and less effective models.

What's the solution?

The authors propose CoLoR-Filter, which uses a smart approach inspired by statistical methods to select the most informative training examples. This method compares how much each example contributes to reducing errors in two smaller 'auxiliary' models. By focusing on these key examples, CoLoR-Filter allows for training larger models with significantly less data. They tested this method on different tasks and found that it could train a large model using 25 times less data than traditional methods while achieving similar performance.

Why it matters?

This research is important because it helps improve the efficiency of training language models. By using CoLoR-Filter, researchers can create better-performing models without needing as much data, which saves time and resources. This advancement can lead to more effective AI applications in various fields, such as natural language processing, education, and customer service.

Abstract

Selecting high-quality data for pre-training is crucial in shaping the downstream task performance of language models. A major challenge lies in identifying this optimal subset, a problem generally considered intractable, thus necessitating scalable and effective heuristics. In this work, we propose a data selection method, CoLoR-Filter (Conditional Loss Reduction Filtering), which leverages an empirical Bayes-inspired approach to derive a simple and computationally efficient selection criterion based on the relative loss values of two auxiliary models. In addition to the modeling rationale, we evaluate CoLoR-Filter empirically on two language modeling tasks: (1) selecting data from C4 for domain adaptation to evaluation on Books and (2) selecting data from C4 for a suite of downstream multiple-choice question answering tasks. We demonstrate favorable scaling both as we subselect more aggressively and using small auxiliary models to select data for large target models. As one headline result, CoLoR-Filter data selected using a pair of 150m parameter auxiliary models can train a 1.2b parameter target model to match a 1.2b parameter model trained on 25b randomly selected tokens with 25x less data for Books and 11x less data for the downstream tasks. Code: https://github.com/davidbrandfonbrener/color-filter-olmo Filtered data: https://huggingface.co/datasets/davidbrandfonbrener/color-filtered-c4

View Paper