Jailbreaking as a Reward Misspecification Problem

Zhihui Xie, Jiahui Gao, Lei Li, Zhenguo Li, Qi Liu, Lingpeng Kong

2024-06-24

Jailbreaking as a Reward Misspecification Problem

Summary

This paper discusses how large language models (LLMs) can be vulnerable to attacks, particularly through a method called jailbreaking. It introduces a new way to understand this vulnerability by focusing on how rewards are incorrectly specified during the process of aligning the model's behavior with human preferences.

What's the problem?

The main issue is that LLMs can be tricked into giving harmful or incorrect outputs because their training process doesn't always accurately define what good behavior is. This misalignment can lead to serious safety concerns, especially when these models are used in sensitive applications.

What's the solution?

The authors propose a new metric called ReGap, which measures how well the model's rewards are aligned with safe and appropriate responses. They also introduce a system named ReMiss that automatically generates prompts designed to test and attack LLMs, helping to identify weaknesses in their alignment. This approach has shown high success rates in detecting vulnerabilities while ensuring that the generated prompts are still understandable to humans.

Why it matters?

Understanding and addressing reward misspecification is crucial for improving the safety and reliability of AI systems. By developing better ways to evaluate and enhance the alignment of LLMs, this research contributes to making AI technologies more trustworthy and effective in real-world applications.

Abstract

The widespread adoption of large language models (LLMs) has raised concerns about their safety and reliability, particularly regarding their vulnerability to adversarial attacks. In this paper, we propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process. We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts. Building upon these insights, we present ReMiss, a system for automated red teaming that generates adversarial prompts against various target aligned LLMs. ReMiss achieves state-of-the-art attack success rates on the AdvBench benchmark while preserving the human readability of the generated prompts. Detailed analysis highlights the unique advantages brought by the proposed reward misspecification objective compared to previous methods.

View Paper