The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

Zichen Wen, Jiashu Qu, Dongrui Liu, Zhiyuan Liu, Ruixi Wu, Yicun Yang, Xiangqi Jin, Haoyun Xu, Xuyang Liu, Weijia Li, Chaochao Lu, Jing Shao, Conghui He, Linfeng Zhang

2025-07-21

The Devil behind the mask: An emergent safety vulnerability of Diffusion
LLMs

Summary

This paper talks about DIJA, a new way to find safety weaknesses in diffusion-based large language models (dLLMs), which are a type of AI that creates text by guessing missing words in both directions at the same time.

What's the problem?

The problem is that these diffusion-based models have special ways of generating text that let harmful or unsafe content slip through existing safety checks because they fill in missing parts of sentences in a way that can complete dangerous prompts.

What's the solution?

The authors built DIJA, a system that creates tricky prompts combining visible harmful text with hidden masked parts, forcing the model to generate harmful completions that follow the dangerous intent. This method fools the model’s safety mechanisms and works better than previous ways to trick the models.

Why it matters?

This matters because it reveals a serious safety gap in new diffusion-based AI models, showing that current methods to keep AI safe might not be enough and that better protection techniques are urgently needed to prevent harmful or unsafe AI outputs.

Abstract

DIJA is a framework that exploits safety weaknesses in diffusion-based large language models by constructing adversarial prompts, demonstrating significant vulnerabilities in their alignment mechanisms.

View Paper