Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

Jonathan Hayase, Alisa Liu, Yejin Choi, Sewoong Oh, Noah A. Smith

2024-07-26

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

Summary

This paper discusses a method called data mixture inference, which helps uncover the types of data used to train language models by analyzing byte-pair encoding (BPE) tokenizers. It reveals the proportions of different languages and data sources represented in these models.

What's the problem?

Language models, like those used in AI, are trained on large amounts of text data, but we often don't know what kinds of data they were trained on or how much of each type is included. This lack of transparency makes it hard to understand how these models might behave or what biases they might have.

What's the solution?

The researchers introduced a new approach that examines the merge rules learned by BPE tokenizers, which are commonly used in language models. By looking at the order in which tokens are combined during training, they can infer the frequency of different types of data in the training set. They developed a method to calculate the proportions of various categories (like different languages) using this information and tested it on known datasets to show its effectiveness. They also applied their method to popular language models and found new insights about their training data composition.

Why it matters?

This research is important because it provides a way to understand the training data behind language models, which can help identify potential biases and improve model transparency. Knowing what types of data are included can lead to better AI systems that are fairer and more reliable, making it crucial for developers and users alike.

Abstract

The pretraining data of today's strongest language models is opaque. In particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information -- byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data: the first merge is the most common byte pair, the second is the most common pair after merging the first token, and so on. Given a tokenizer's merge list along with data samples for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. Importantly, to the extent to which tokenizer training data is representative of the pretraining data, we indirectly learn about the pretraining data. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: GPT-4o's tokenizer is much more multilingual than its predecessors, training on 39% non-English data; Llama3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use; GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.

View Paper