AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs

Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy Vorobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, Chaowei Xiao

2024-10-10

AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs

Summary

This paper introduces AutoDAN-Turbo, a new method that allows large language models (LLMs) to automatically discover ways to bypass their restrictions, known as jailbreak strategies, without needing any human help.

What's the problem?

Current methods for jailbreaking LLMs often rely on predefined strategies created by humans. This limits the number of strategies available and can make it harder to find effective ways to bypass the models' safety features. Additionally, these methods can be inefficient and may not adapt well to different situations.

What's the solution?

AutoDAN-Turbo addresses this issue by using a system that can explore and learn new jailbreak strategies on its own. It operates like a lifelong learning agent, continuously discovering and refining strategies during its operation. The method has shown impressive results, achieving an average attack success rate significantly higher than existing methods. It can also incorporate human-designed strategies easily, further improving its effectiveness.

Why it matters?

This research is important because it highlights the potential vulnerabilities in LLMs and demonstrates how they can be exploited. Understanding these vulnerabilities is crucial for developing better safety measures and ensuring that AI systems behave responsibly. However, it also raises ethical concerns about the misuse of such technology, emphasizing the need for careful consideration of how these tools are used.

Abstract

In this paper, we propose AutoDAN-Turbo, a black-box jailbreak method that can automatically discover as many jailbreak strategies as possible from scratch, without any human intervention or predefined scopes (e.g., specified candidate strategies), and use them for red-teaming. As a result, AutoDAN-Turbo can significantly outperform baseline methods, achieving a 74.3% higher average attack success rate on public benchmarks. Notably, AutoDAN-Turbo achieves an 88.5 attack success rate on GPT-4-1106-turbo. In addition, AutoDAN-Turbo is a unified framework that can incorporate existing human-designed jailbreak strategies in a plug-and-play manner. By integrating human-designed strategies, AutoDAN-Turbo can even achieve a higher attack success rate of 93.4 on GPT-4-1106-turbo.

View Paper