CRISP: Persistent Concept Unlearning via Sparse Autoencoders
Tomer Ashuach, Dana Arad, Aaron Mueller, Martin Tutek, Yonatan Belinkov
2025-08-25
Summary
This paper focuses on how to make large language models 'forget' specific harmful information they've learned, without messing up their ability to do everything else they're good at.
What's the problem?
Large language models can sometimes generate unsafe or unwanted responses because of things they learned during their training. Existing methods to fix this usually only work when the model is being *used*, and someone with access to the model's core programming could easily undo those fixes. It's like putting a temporary band-aid on a problem that someone can just rip off.
What's the solution?
The researchers developed a new technique called CRISP that permanently changes the model's internal settings to remove the unwanted knowledge. It works by finding the specific parts of the model that are responsible for that knowledge and then subtly suppressing them. This isn't a temporary fix; it actually alters how the model works at a fundamental level, making it much harder to reverse.
Why it matters?
This is important because as we rely more on these models for important tasks, we need to be able to trust them to be safe and reliable. Being able to permanently remove harmful knowledge, while still keeping the model useful, is a big step towards making that trust possible and protecting against malicious use.
Abstract
As large language models (LLMs) are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while preserving model utility has become paramount. Recent work has explored sparse autoencoders (SAEs) to perform precise interventions on monosemantic features. However, most SAE-based methods operate at inference time, which does not create persistent changes in the model's parameters. Such interventions can be bypassed or reversed by malicious actors with parameter access. We introduce CRISP, a parameter-efficient method for persistent concept unlearning using SAEs. CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations. We experiment with two LLMs and show that our method outperforms prior approaches on safety-critical unlearning tasks from the WMDP benchmark, successfully removing harmful knowledge while preserving general and in-domain capabilities. Feature-level analysis reveals that CRISP achieves semantically coherent separation between target and benign concepts, allowing precise suppression of the target features.