Jailbreaking with Universal Multi-Prompts

Yu-Ling Hsu, Hsuan Su, Shang-Tse Chen

2025-02-06

Jailbreaking with Universal Multi-Prompts

Summary

This paper talks about a new method called JUMP that uses special prompts to trick AI language models into doing things they're not supposed to do, like giving harmful information. It also introduces DUMP, which is a way to defend against these tricks.

What's the problem?

As AI language models get better, some people try to make them do bad things by using clever tricks called 'jailbreaking'. Current methods to do this are often slow and only work for one task at a time, which isn't very efficient.

What's the solution?

The researchers created JUMP, which uses a set of carefully designed prompts that can work on many different tasks at once. This makes it faster and more efficient at finding ways to bypass the AI's safety measures. They also developed DUMP, which uses similar ideas to protect AI models from these attacks.

Why it matters?

This research matters because it shows both how vulnerable AI models can be and how we might protect them. Understanding these vulnerabilities helps make AI systems safer and more reliable for everyone to use. It's like finding weak spots in a castle's defenses so we can build stronger walls.

Abstract

Large language models (LLMs) have seen rapid development in recent years, revolutionizing various applications and significantly enhancing convenience and productivity. However, alongside their impressive capabilities, ethical concerns and new types of attacks, such as jailbreaking, have emerged. While most prompting techniques focus on optimizing adversarial inputs for individual cases, resulting in higher computational costs when dealing with large datasets. Less research has addressed the more general setting of training a universal attacker that can transfer to unseen tasks. In this paper, we introduce JUMP, a prompt-based method designed to jailbreak LLMs using universal multi-prompts. We also adapt our approach for defense, which we term DUMP. Experimental results demonstrate that our method for optimizing universal multi-prompts outperforms existing techniques.

View Paper