X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, Saadia Gabriel
2025-04-22
Summary
This paper talks about X-Teaming, a new system that uses teams of AI agents to test and improve the safety of language models by simulating complex, multi-step conversations that try to trick the model into breaking its rules.
What's the problem?
The problem is that most safety tests for language models only look at simple, one-question attacks, but in real life, people can use longer, more strategic conversations to get around safety protections. Language models aren’t well-prepared for these multi-turn attacks, which can make them vulnerable to giving harmful or unsafe answers if someone is persistent or clever enough.
What's the solution?
The researchers built X-Teaming, which uses several AI agents working together to plan, carry out, and check multi-turn attacks against language models. This framework can create many different attack scenarios and has a very high success rate at finding weaknesses. To help defend against these attacks, they also created XGuard-Train, a huge dataset of tricky conversations that can be used to train language models to recognize and refuse harmful requests, even when the attack is spread out over several messages.
Why it matters?
This matters because it helps make AI safer and more trustworthy by preparing language models to handle real-world situations where people might try to outsmart them. By testing and training with these advanced methods, developers can build language models that are much harder to trick, protecting users from harmful or dangerous content.
Abstract
X-Teaming is a scalable framework that explores and generates multi-turn attack scenarios against language models, achieving high success rates, and XGuard-Train is a large multi-turn safety training dataset to mitigate such attacks.