X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, Saadia Gabriel

2025-04-22

X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

Summary

This paper talks about X-Teaming, a new system that uses teams of AI agents to test and improve the safety of language models by simulating complex, multi-step conversations that try to trick the model into breaking its rules.

What's the problem?

The problem is that most safety tests for language models only look at simple, one-question attacks, but in real life, people can use longer, more strategic conversations to get around safety protections. Language models aren’t well-prepared for these multi-turn attacks, which can make them vulnerable to giving harmful or unsafe answers if someone is persistent or clever enough.

What's the solution?

The researchers built X-Teaming, which uses several AI agents working together to plan, carry out, and check multi-turn attacks against language models. This framework can create many different attack scenarios and has a very high success rate at finding weaknesses. To help defend against these attacks, they also created XGuard-Train, a huge dataset of tricky conversations that can be used to train language models to recognize and refuse harmful requests, even when the attack is spread out over several messages.

Why it matters?

This matters because it helps make AI safer and more trustworthy by preparing language models to handle real-world situations where people might try to outsmart them. By testing and training with these advanced methods, developers can build language models that are much harder to trick, protecting users from harmful or dangerous content.

Abstract

X-Teaming is a scalable framework that explores and generates multi-turn attack scenarios against language models, achieving high success rates, and XGuard-Train is a large multi-turn safety training dataset to mitigate such attacks.

View Paper