JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model

Yi Nian, Shenzhe Zhu, Yuehan Qin, Li Li, Ziyi Wang, Chaowei Xiao, Yue Zhao

2025-04-08

JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language
Model

Summary

This paper talks about JailDAM, a safety system that spots sneaky attempts to trick AI image-text models into making harmful content, like violent or hateful messages, without needing to see examples of bad content beforehand.

What's the problem?

Current safety checks for AI image-text systems either need access to the AI's inner workings (which companies often hide), take too long to detect threats, or require lots of pre-labeled bad examples that are hard to get.

What's the solution?

JailDAM uses a smart memory system that learns what 'unsafe' looks like from normal content, then updates its knowledge during use to catch new attack methods quickly without slowing things down.

Why it matters?

This helps keep AI image generators and chatbots safer in real-world use, like social media filters that block harmful content faster and more reliably without invading companies' private AI tech.

Abstract

Multimodal large language models (MLLMs) excel in vision-language tasks but also pose significant risks of generating harmful content, particularly through jailbreak attacks. Jailbreak attacks refer to intentional manipulations that bypass safety mechanisms in models, leading to the generation of inappropriate or unsafe content. Detecting such attacks is critical to ensuring the responsible deployment of MLLMs. Existing jailbreak detection methods face three primary challenges: (1) Many rely on model hidden states or gradients, limiting their applicability to white-box models, where the internal workings of the model are accessible; (2) They involve high computational overhead from uncertainty-based analysis, which limits real-time detection, and (3) They require fully labeled harmful datasets, which are often scarce in real-world settings. To address these issues, we introduce a test-time adaptive framework called JAILDAM. Our method leverages a memory-based approach guided by policy-driven unsafe knowledge representations, eliminating the need for explicit exposure to harmful data. By dynamically updating unsafe knowledge during test-time, our framework improves generalization to unseen jailbreak strategies while maintaining efficiency. Experiments on multiple VLM jailbreak benchmarks demonstrate that JAILDAM delivers state-of-the-art performance in harmful content detection, improving both accuracy and speed.

View Paper