Extracting alignment data in open models

Federico Barbero, Xiangming Gu, Christopher A. Choquette-Choo, Chawin Sitawarin, Matthew Jagielski, Itay Yona, Petar Veličković, Ilia Shumailov, Jamie Hayes

2025-10-22

Extracting alignment data in open models

Summary

This paper explores how much information about the original training data is still 'remembered' within large language models after they've been further trained to be helpful and harmless, and how we can actually *extract* that information.

What's the problem?

Researchers have been trying to figure out how much of the original training data these models memorize, but most methods just look for exact matches of text. This isn't very effective because models can express the same idea in different words. The problem is accurately identifying when a model is recalling something from its training, and understanding how much useful information can be recovered.

What's the solution?

The researchers used a different approach, looking at the *meaning* of the text using something called 'embeddings'. Embeddings represent words and phrases as points in a space where similar meanings are close together. This allowed them to find more instances of the model recalling training data, even if the wording wasn't identical. They found models easily repeat data used in later training steps, and surprisingly, this extracted data could be used to retrain a model and regain some of its original abilities.

Why it matters?

This work highlights a potential security risk: someone could extract sensitive information that was in the original training data. It also suggests that 'distillation' – a technique where a smaller model learns from a larger one – might actually be a way of indirectly training on the original dataset, which has implications for how we understand and control these models.

Abstract

In this work, we show that it is possible to extract significant amounts of alignment training data from a post-trained model -- useful to steer the model to improve certain capabilities such as long-context reasoning, safety, instruction following, and maths. While the majority of related work on memorisation has focused on measuring success of training data extraction through string matching, we argue that embedding models are better suited for our specific goals. Distances measured through a high quality embedding model can identify semantic similarities between strings that a different metric such as edit distance will struggle to capture. In fact, in our investigation, approximate string matching would have severely undercounted (by a conservative estimate of 10times) the amount of data that can be extracted due to trivial artifacts that deflate the metric. Interestingly, we find that models readily regurgitate training data that was used in post-training phases such as SFT or RL. We show that this data can be then used to train a base model, recovering a meaningful amount of the original performance. We believe our work exposes a possibly overlooked risk towards extracting alignment data. Finally, our work opens up an interesting discussion on the downstream effects of distillation practices: since models seem to be regurgitating aspects of their training set, distillation can therefore be thought of as indirectly training on the model's original dataset.

View Paper