Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

Zhexin Zhang, Junxiao Yang, Pei Ke, Shiyao Cui, Chujie Zheng, Hongning Wang, Minlie Huang

2024-07-08

Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

Summary

This paper talks about a new method called Safe Unlearning, designed to help large language models (LLMs) avoid generating harmful responses when faced with jailbreak attacks, which are attempts to trick AI into providing unsafe information.

What's the problem?

The main problem is that LLMs can be vulnerable to jailbreak attacks, where users try to manipulate the model into giving harmful answers, like instructions for dangerous activities. Even after safety measures are applied, these models can still produce unsafe responses because they often rely on harmful knowledge that they have learned during training.

What's the solution?

To tackle this issue, the authors propose the Safe Unlearning approach, which focuses on directly removing harmful knowledge from the model instead of just trying to adjust its behavior with more training (known as supervised fine-tuning). They conducted experiments using only 20 specific harmful questions without any additional prompts to train the model. This method significantly reduced the Attack Success Rate (ASR) from 82.6% to just 7.7% for harmful questions wrapped in complex prompts. This shows that by unlearning harmful information, the model can better protect itself against these attacks.

Why it matters?

This research is important because it offers a more effective way to make AI systems safer and more reliable. By focusing on unlearning harmful knowledge, Safe Unlearning could lead to better defenses against attempts to misuse AI technology, ensuring that these systems provide safe and helpful information in real-world applications.

Abstract

LLMs are known to be vulnerable to jailbreak attacks, even after safety alignment. An important observation is that, while different types of jailbreak attacks can generate significantly different queries, they mostly result in similar responses that are rooted in the same harmful knowledge (e.g., detailed steps to make a bomb). Therefore, we conjecture that directly unlearn the harmful knowledge in the LLM can be a more effective way to defend against jailbreak attacks than the mainstream supervised fine-tuning (SFT) based approaches. Our extensive experiments confirmed our insight and suggested surprising generalizability of our unlearning-based approach: using only 20 raw harmful questions without any jailbreak prompt during training, our solution reduced the Attack Success Rate (ASR) in Vicuna-7B on out-of-distribution (OOD) harmful questions wrapped with various complex jailbreak prompts from 82.6\% to 7.7\%. This significantly outperforms Llama2-7B-Chat, which is fine-tuned on about 0.1M safety alignment samples but still has an ASR of 21.9\% even under the help of an additional safety system prompt. Further analysis reveals that the generalization ability of our solution stems from the intrinsic relatedness among harmful responses across harmful questions (e.g., response patterns, shared steps and actions, and similarity among their learned representations in the LLM). Our code is available at https://github.com/thu-coai/SafeUnlearning.

View Paper