Measuring memorization in RLHF for code completion
Aneesh Pappu, Billy Porter, Ilia Shumailov, Jamie Hayes
2024-06-20

Summary
This paper examines how large language models (LLMs) remember information during a training process called Reinforcement Learning with Human Feedback (RLHF), specifically in the context of code completion.
What's the problem?
As LLMs are trained using real user data to better understand and respond to preferences, there is a risk that these models could memorize sensitive information. If a model remembers this data and later reveals it in its responses, it could lead to privacy issues. However, it's unclear how memorization happens during the RLHF process compared to other training methods.
What's the solution?
The researchers conducted a study to analyze how memorization occurs at different stages of RLHF, focusing on code completion tasks. They discovered that RLHF reduces the likelihood of memorizing data used for training the model's reward system compared to traditional fine-tuning methods. However, if data was memorized during earlier fine-tuning, it usually remained memorized even after RLHF training.
Why it matters?
Understanding how memorization works in RLHF is crucial because it helps ensure that user data is handled safely and responsibly. By identifying when and how models might memorize sensitive information, developers can create better safeguards to protect user privacy while still improving AI performance.
Abstract
Reinforcement learning with human feedback (RLHF) has become the dominant method to align large models to user preferences. Unlike fine-tuning, for which there are many studies regarding training data memorization, it is not clear how memorization is affected by or introduced in the RLHF alignment process. Understanding this relationship is important as real user data may be collected and used to align large models; if user data is memorized during RLHF and later regurgitated, this could raise privacy concerns. In this work, we analyze how training data memorization can surface and propagate through each phase of RLHF. We focus our study on code completion models, as code completion is one of the most popular use cases for large language models. We find that RLHF significantly decreases the chance that data used for reward modeling and reinforcement learning is memorized, in comparison to aligning via directly fine-tuning on this data, but that examples already memorized during the fine-tuning stage of RLHF, will, in the majority of cases, remain memorized after RLHF.