Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell, Nathan Bailey, Joe Benton, Emma Bluemke, Samuel R. Bowman, Eric Christiansen, Hoagy Cunningham, Andy Dau, Anjali Gopal, Rob Gilson, Logan Graham, Logan Howard

2025-02-03

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Summary

This paper talks about a new way to protect AI language models from being tricked into doing harmful things. The researchers created something called Constitutional Classifiers, which act like smart guards for these AI systems.

What's the problem?

Big AI language models can sometimes be tricked or 'jailbroken' into doing bad things, like helping someone make illegal drugs. This is a serious problem because it means these AI systems could be misused to cause harm, even though they're designed to be helpful and safe.

What's the solution?

The researchers came up with Constitutional Classifiers. These are like special filters that are trained using made-up conversations that show what's okay and not okay for the AI to talk about. They tested these classifiers for over 3,000 hours, trying to trick them, but couldn't find a way to consistently bypass them. The classifiers were good at blocking harmful requests without accidentally blocking too many normal requests.

Why it matters?

This matters because it makes AI language models much safer to use in the real world. It helps prevent people from misusing these powerful AI tools for harmful purposes, while still keeping the AI useful for everyday tasks. This could make companies and users feel more confident about using advanced AI systems, knowing they have strong protection against being tricked into doing bad things.

Abstract

Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.

View Paper