Large Language Model Unlearning via Embedding-Corrupted Prompts

Chris Yuhao Liu, Yaxuan Wang, Jeffrey Flanigan, Yang Liu

2024-06-13

Large Language Model Unlearning via Embedding-Corrupted Prompts

Summary

This paper presents a new method called Embedding-Corrupted Prompts (ECO) designed to help large language models (LLMs) 'unlearn' unwanted knowledge or biases. This is important for making sure these models are safe and aligned with human values.

What's the problem?

Large language models are trained on vast amounts of information, which can sometimes include biases or incorrect facts that we want them to forget. However, teaching these models to forget specific information is difficult because it can lead to unintended consequences, like losing important knowledge or requiring a lot of computing power to retrain them. This makes it challenging to control what the model should not know.

What's the solution?

The authors developed ECO, a lightweight framework that helps LLMs unlearn unwanted information without needing extensive retraining. Instead of relying on the model to figure out how to forget on its own, ECO uses a prompt classifier to identify which prompts (or questions) should result in forgetting specific information. They then introduce 'corruptions' into these prompts, which helps the model adjust its responses as if it had never learned the unwanted information. This approach allows for effective unlearning while minimizing negative side effects.

Why it matters?

This research is significant because it provides a practical solution for improving the safety and reliability of large language models. By enabling these models to selectively forget harmful or biased knowledge, we can ensure they behave more ethically and align better with human values. Additionally, the method is scalable, meaning it can be applied to a wide range of models without increasing costs, making it a valuable tool for future AI development.

Abstract

Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a large language model should not know is important for ensuring alignment and thus safe use. However, accurately and efficiently unlearning knowledge from an LLM remains challenging due to the potential collateral damage caused by the fuzzy boundary between retention and forgetting, and the large computational requirements for optimization across state-of-the-art models with hundreds of billions of parameters. In this work, we present Embedding-COrrupted (ECO) Prompts, a lightweight unlearning framework for large language models to address both the challenges of knowledge entanglement and unlearning efficiency. Instead of relying on the LLM itself to unlearn, we enforce an unlearned state during inference by employing a prompt classifier to identify and safeguard prompts to forget. We learn corruptions added to prompt embeddings via zeroth order optimization toward the unlearning objective offline and corrupt prompts flagged by the classifier during inference. We find that these embedding-corrupted prompts not only lead to desirable outputs that satisfy the unlearning objective but also closely approximate the output from a model that has never been trained on the data intended for forgetting. Through extensive experiments on unlearning, we demonstrate the superiority of our method in achieving promising unlearning at nearly zero side effects in general domains and domains closely related to the unlearned ones. Additionally, we highlight the scalability of our method to 100 LLMs, ranging from 0.5B to 236B parameters, incurring no additional cost as the number of parameters increases.

View Paper