Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces

Yihuai Hong, Lei Yu, Shauli Ravfogel, Haiqin Yang, Mor Geva

2024-06-20

Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces

Summary

This paper discusses a new way to evaluate how well large language models (LLMs) can 'unlearn' harmful or unwanted information by looking at changes in their internal knowledge instead of just their behavior.

What's the problem?

As LLMs are trained on vast amounts of data, they can sometimes learn harmful or private information that needs to be forgotten. Current methods for evaluating whether an LLM has successfully unlearned this information mainly focus on how the model behaves after unlearning. However, this approach doesn't check if the unwanted knowledge is still hidden within the model's parameters, which could be exploited by others to recover that information.

What's the solution?

The researchers propose a new evaluation method that looks at the internal changes in the model's parameters, called 'parametric knowledge traces.' They create a dataset named ConceptVectors, which contains many common concepts and their associated knowledge traces in two open-source LLMs. By analyzing these concept vectors, they find that existing unlearning methods do not significantly change these internal traces, indicating that unwanted knowledge may still exist. In contrast, directly removing these concept vectors effectively erases the associated knowledge and makes the models less vulnerable to manipulation.

Why it matters?

This research is important because it highlights the need for better evaluation methods when it comes to unlearning in LLMs. By focusing on internal knowledge rather than just behavior, developers can ensure that harmful information is truly forgotten, which is crucial for privacy and safety in AI applications. This could lead to more responsible use of AI technologies and better protection against potential misuse.

Abstract

The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance for mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general methodology for eliciting directions in the parameter space (termed "concept vectors") that encode concrete concepts, and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors, while directly ablating these vectors demonstrably removes the associated knowledge from the LLMs and significantly reduces their susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parametric-based evaluations. To support this, we release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.

View Paper