Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Jiahao Xu, Tian Liang, Pinjia He, Zhaopeng Tu

2024-07-15

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

Summary

This paper introduces a new training method called Decoupled Refusal Training (DeRTa) to improve the safety of large language models (LLMs) by helping them refuse harmful prompts more effectively.

What's the problem?

Large language models can sometimes generate unsafe or inappropriate content if they are prompted with harmful requests. Current training methods do not always prepare these models to refuse such prompts properly, which can lead to serious issues when they encounter dangerous or misleading questions.

What's the solution?

The authors developed DeRTa, which trains LLMs to recognize and refuse harmful prompts at any point in their responses. This method includes two main techniques: one that helps the model learn from examples of harmful responses by adding a harmful segment to safe responses, and another that trains the model to smoothly transition from recognizing a harmful prompt to refusing it. This approach allows the model to better handle various situations where it needs to say 'no' to unsafe requests.

Why it matters?

This research is important because it enhances the safety and reliability of AI systems, making them less likely to generate harmful content. By improving how LLMs refuse unsafe prompts, we can build more trustworthy AI applications that protect users from potential risks, ensuring safer interactions in areas like customer support, education, and social media.

Abstract

This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) by identifying and tackling a refusal position bias within safety tuning data, which compromises the models' ability to appropriately refuse generating unsafe content. We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position, significantly enhancing their safety capabilities. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation (MLE) with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence. Our empirical evaluation, conducted using LLaMA3 and Mistral model families across six attack scenarios, demonstrates that our method not only improves model safety without compromising performance but also surpasses well-known models such as GPT-4 in defending against attacks. Importantly, our approach successfully defends recent advanced attack methods (e.g., CodeAttack) that have jailbroken GPT-4 and LLaMA3-70B-Instruct. Our code and data can be found at https://github.com/RobustNLP/DeRTa.

View Paper