Universal Jailbreak Suffixes Are Strong Attention Hijackers
Matan Ben-Tov, Mor Geva, Mahmood Sharif
2025-06-18
Summary
This paper talks about how certain sneaky additions called suffix-based jailbreaks can trick large language models by capturing their attention and making them act in ways they shouldn't, like ignoring the original instructions.
What's the problem?
The problem is that large language models can be fooled by special text added at the end of their input (suffixes), which hijack their attention and cause them to produce unwanted or harmful responses, creating security and reliability issues.
What's the solution?
The researchers studied these suffixes and how they work to hijack attention in the models. They showed that these suffixes are very effective across many cases but also found ways to improve attacks and defend against them with minimal extra cost or impact on the model's normal functions.
Why it matters?
This matters because understanding and stopping these attention hijacking tricks helps make AI systems safer and more trustworthy, preventing misuse and ensuring they behave properly even when attackers try to manipulate them.
Abstract
Suffix-based jailbreaks exploit adversarial suffixes to hijack large language models, with effectiveness linked to suffix universality; the method can be enhanced and mitigated with minimal computational or utility cost.