Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

Chiyu Zhang, Lu Zhou, Xiaogang Xu, Jiafei Wu, Liming Fang, Zhe Liu

2025-08-25

Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

Summary

This paper focuses on the difficulty of accurately testing how easily large language models (like chatbots) can be 'jailbroken' – tricked into providing harmful or inappropriate responses. It introduces a new method for identifying malicious prompts and responses, and also explores new techniques for *successfully* jailbreaking these models.

What's the problem?

Testing for jailbreaks is hard because many attempts to trick the model aren't obviously harmful, or simply don't work. Existing datasets used for this testing often contain a lot of ineffective prompts, making it difficult to get a clear picture of the model's vulnerabilities. Current methods for identifying harmful content either require a lot of manual work by people, or rely on other AI models which aren't always reliable at spotting everything.

What's the solution?

The researchers developed a system called MDH, which combines the speed of AI with a small amount of human review to quickly and accurately identify malicious prompts and responses. They used this system to clean up existing datasets. They also discovered that carefully worded instructions given *to* the AI (called 'developer messages') can actually make it easier to jailbreak, and created two new attack methods, D-Attack and DH-CoT, that take advantage of this.

Why it matters?

This work is important because it provides a better way to evaluate the safety of large language models. By improving the testing process and understanding how to exploit vulnerabilities, developers can build more robust and secure AI systems that are less likely to be misused or generate harmful content. The release of their code and data will also help other researchers in this field.

Abstract

Evaluating jailbreak attacks is challenging when prompts are not overtly harmful or fail to induce harmful outputs. Unfortunately, many existing red-teaming datasets contain such unsuitable prompts. To evaluate attacks accurately, these datasets need to be assessed and cleaned for maliciousness. However, existing malicious content detection methods rely on either manual annotation, which is labor-intensive, or large language models (LLMs), which have inconsistent accuracy in harmful types. To balance accuracy and efficiency, we propose a hybrid evaluation framework named MDH (Malicious content Detection based on LLMs with Human assistance) that combines LLM-based annotation with minimal human oversight, and apply it to dataset cleaning and detection of jailbroken responses. Furthermore, we find that well-crafted developer messages can significantly boost jailbreak success, leading us to propose two new strategies: D-Attack, which leverages context simulation, and DH-CoT, which incorporates hijacked chains of thought. The Codes, datasets, judgements, and detection results will be released in github repository: https://github.com/AlienZhang1996/DH-CoT.

View Paper