Erasing Conceptual Knowledge from Language Models

Rohit Gandikota, Sheridan Feucht, Samuel Marks, David Bau

2024-10-07

Erasing Conceptual Knowledge from Language Models

Summary

This paper discusses a new method for removing specific knowledge from language models, called Erasure of Language Memory (ELM), and introduces a framework to evaluate how effectively this erasure is achieved.

What's the problem?

Language models can sometimes hold onto unwanted or sensitive information that should be removed. However, previous methods for erasing this knowledge have not been thoroughly evaluated, making it hard to know how effective they really are. This can lead to incomplete assessments and potentially harmful outcomes if sensitive information is still accessible.

What's the solution?

To address this issue, the authors propose a new evaluation framework based on three key criteria: 'innocence' (complete removal of knowledge), 'seamlessness' (the model should still generate fluent text), and 'specificity' (the model should perform well on unrelated tasks). They developed ELM, which uses targeted updates to change the model's outputs regarding erased concepts while keeping its overall abilities intact. The authors tested ELM on various tasks related to biosecurity, cybersecurity, and literature, showing that it effectively removes unwanted knowledge while maintaining performance in other areas.

Why it matters?

This research is important because it provides a structured way to evaluate and improve the process of erasing knowledge from language models. By ensuring that sensitive information can be safely removed without compromising the model's overall functionality, ELM helps make AI systems safer and more reliable for users.

Abstract

Concept erasure in language models has traditionally lacked a comprehensive evaluation framework, leading to incomplete assessments of effectiveness of erasure methods. We propose an evaluation paradigm centered on three critical criteria: innocence (complete knowledge removal), seamlessness (maintaining conditional fluent generation), and specificity (preserving unrelated task performance). Our evaluation metrics naturally motivate the development of Erasure of Language Memory (ELM), a new method designed to address all three dimensions. ELM employs targeted low-rank updates to alter output distributions for erased concepts while preserving overall model capabilities including fluency when prompted for an erased concept. We demonstrate ELM's efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative analysis shows that ELM achieves superior performance across our proposed metrics, including near-random scores on erased topic assessments, generation fluency, maintained accuracy on unrelated benchmarks, and robustness under adversarial attacks. Our code, data, and trained models are available at https://elm.baulab.info

View Paper