Practical Unlearning for Large Language Models
Chongyang Gao, Lixu Wang, Chenkai Weng, Xiao Wang, Qi Zhu
2024-07-18

Summary
This paper discusses a new framework called O3 that helps large language models (LLMs) 'unlearn' unwanted information while still performing well on other tasks.
What's the problem?
Large language models can sometimes memorize sensitive or inappropriate information, which raises privacy and security concerns. Traditional methods for 'unlearning' this information often require access to the original training data, which is not always available. Additionally, these methods struggle to effectively remove unwanted knowledge without affecting the model's overall performance.
What's the solution?
The authors propose the O3 framework, which includes an Out-Of-Distribution (OOD) detector to identify similarities between current inputs and the data that needs to be unlearned. It also uses an Orthogonal low-rank adapter (LoRA) to manage continuous unlearning requests without needing to retain any original data. This approach allows the model to decide when and how much to 'unlearn' based on the detected similarities, ensuring that it can still function effectively while removing unwanted knowledge.
Why it matters?
This research is important because it addresses the growing need for AI systems to be ethical and secure by allowing them to forget harmful or sensitive information. By improving how LLMs can unlearn unwanted data, the O3 framework can help make these models safer and more reliable for users, ultimately enhancing trust in AI technologies.
Abstract
While LLMs have demonstrated impressive performance across various domains and tasks, their security issues have become increasingly severe. Machine unlearning (MU) has emerged as a promising solution to address these issues by removing the influence of undesired data on the target model without compromising its utility in other aspects. MU typically assumes full access to the original training data to preserve utility, which is difficult to achieve in LLM unlearning. Existing LLM unlearning methods often assume access to data most affected by undesired data unlearning. However, this assumption underestimates the entanglement among various LLM capabilities and ignores data access limitations due to various issues. Moreover, these LLM unlearning methods do not sufficiently consider that unlearning requests in real-world scenarios are continuously emerging. To overcome these challenges and achieve practical LLM unlearning, we propose the O3 framework. The O3 framework includes an Out-Of-Distribution (OOD) detector to measure the similarity between input and unlearning data, and an Orthogonal low-rank adapter (LoRA) for continuously unlearning requested data. The OOD detector is trained with a novel contrastive entropy loss and utilizes a local-global layer-aggregated scoring mechanism. The orthogonal LoRA achieves parameter disentanglement among continual unlearning requests. During inference, our O3 framework can smartly decide whether and to what extent to load the unlearning LoRA based on the OOD detector's predictions. Notably, O3's effectiveness does not rely on any retained data. We conducted extensive experiments on O3 and state-of-the-art LLM unlearning methods across three tasks and seven datasets. The results indicate that O3 consistently achieves the best trade-off between unlearning effectiveness and utility preservation, especially when facing continuous unlearning requests.