Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs
Yunhao Chen, Xin Wang, Juncheng Li, Yixu Wang, Jie Li, Yan Teng, Yingchun Wang, Xingjun Ma
2025-11-18
Summary
This paper introduces a new way to test the safety of large language models (LLMs) like ChatGPT by automatically finding ways to trick them into doing things they shouldn't. It's about making 'red teaming' – the process of trying to break a system – more effective.
What's the problem?
Current methods for automatically testing LLM safety rely on using pre-written tricks or slightly modifying existing ones. This is like a student only being able to answer questions by memorizing answers from a textbook. They can't come up with truly *new* ways to bypass the safety measures, limiting how thoroughly the LLM can be tested. Essentially, these systems lack creativity in finding vulnerabilities.
What's the solution?
The researchers created a system called EvoSynth. Instead of just tweaking existing attacks, EvoSynth *writes its own* attack code from scratch, using a system where different parts of the code work together and evolve over time. If an attack fails, the system automatically rewrites the code to try a different approach, learning from its mistakes. Think of it like a computer program that teaches itself how to hack, constantly improving its methods.
Why it matters?
EvoSynth is a big step forward because it's much better at finding weaknesses in LLMs than previous methods, achieving a high success rate against a strong model. More importantly, it opens up a new area of research focused on *evolving* attacks, which will be crucial for keeping LLMs safe as they become more powerful and complex. It provides a tool for researchers to proactively identify and address vulnerabilities before they can be exploited.
Abstract
Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet they share a fundamental limitation: their jailbreak logic is confined to selecting, combining, or refining pre-existing attack strategies. This binds their creativity and leaves them unable to autonomously invent entirely new attack mechanisms. To overcome this gap, we introduce EvoSynth, an autonomous framework that shifts the paradigm from attack planning to the evolutionary synthesis of jailbreak methods. Instead of refining prompts, EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute novel, code-based attack algorithms. Crucially, it features a code-level self-correction loop, allowing it to iteratively rewrite its own attack logic in response to failure. Through extensive experiments, we demonstrate that EvoSynth not only establishes a new state-of-the-art by achieving an 85.5\% Attack Success Rate (ASR) against highly robust models like Claude-Sonnet-4.5, but also generates attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research in this new direction of evolutionary synthesis of jailbreak methods. Code is available at: https://github.com/dongdongunique/EvoSynth.