WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, Nouha Dziri
2024-06-27

Summary
This paper introduces WildTeaming, a new automated system designed to improve the safety of large language models (LLMs) by identifying and understanding different ways that users can trick these models into providing harmful responses, known as jailbreak tactics.
What's the problem?
As language models become more widely used, they are at risk of being manipulated by users who try to bypass their safety features. These manipulations can lead to the models generating inappropriate or harmful content. Existing methods for testing these vulnerabilities often rely on human testers or outdated techniques, which may not effectively uncover all possible weaknesses in the models.
What's the solution?
To address this issue, the authors developed WildTeaming, which automatically analyzes real interactions between users and chatbots to discover new jailbreak tactics. They identified over 5,700 unique strategies that users might employ to trick the models. Additionally, they created a large dataset called WildJailbreak, which includes 262,000 examples of both harmful and benign prompts to help train models on how to recognize and handle these tactics effectively. This dataset allows for better training without overly restricting the model’s ability to respond to legitimate queries.
Why it matters?
This research is important because it enhances the safety and reliability of language models by providing a systematic way to identify and address their vulnerabilities. By improving how these models can resist manipulation, we can ensure they provide safer interactions for users in various applications, such as customer service, education, and social media.
Abstract
We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics, and then composes multiple tactics for systematic exploration of novel jailbreaks. Compared to prior work that performed red-teaming via recruited human workers, gradient-based optimization, or iterative revision with LLMs, our work investigates jailbreaks from chatbot users who were not specifically instructed to break the system. WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks compared to state-of-the-art jailbreak methods. While many datasets exist for jailbreak evaluation, very few open-source datasets exist for jailbreak training, as safety training data has been closed even when model weights are open. With WildTeaming we create WildJailbreak, a large-scale open-source synthetic safety dataset with 262K vanilla (direct request) and adversarial (complex jailbreak) prompt-response pairs. To mitigate exaggerated safety behaviors, WildJailbreak provides two contrastive types of queries: 1) harmful queries (vanilla & adversarial) and 2) benign queries that resemble harmful queries in form but contain no harm. As WildJailbreak considerably upgrades the quality and scale of existing safety resources, it uniquely enables us to examine the scaling effects of data and the interplay of data properties and model capabilities during safety training. Through extensive experiments, we identify the training properties that enable an ideal balance of safety behaviors: appropriate safeguarding without over-refusal, effective handling of vanilla and adversarial queries, and minimal, if any, decrease in general capabilities. All components of WildJailbeak contribute to achieving balanced safety behaviors of models.