OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Xiaojun Jia, Jie Liao, Qi Guo, Teng Ma, Simeng Qin, Ranjie Duan, Tianlin Li, Yihao Huang, Zhitao Zeng, Dongxian Wu, Yiming Li, Wenqi Ren, Xiaochun Cao, Yang Liu

2025-12-09

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Summary

This paper introduces a new tool called OmniSafeBench-MM designed to test how easily multi-modal large language models, which can understand both text and images, can be tricked into giving harmful responses.

What's the problem?

Current methods for testing the safety of these models are limited because they only look at a few types of attacks, don't have a standard way to measure defenses, and lack a shared toolkit for researchers to use and reproduce results. Basically, it's hard to consistently and thoroughly check if these AI systems are truly safe and won't be exploited to generate dangerous content.

What's the solution?

The researchers created OmniSafeBench-MM, a comprehensive toolbox that includes 13 different attack methods, 15 ways to defend against attacks, and a large dataset covering 9 different risk areas with 50 specific categories. They also developed a detailed evaluation system that measures how harmful a response is, how well it matches the user's intent, and how much detail it provides. They tested this toolbox on 18 different AI models, both publicly available and proprietary.

Why it matters?

This work is important because it provides a standardized and reproducible way to evaluate the safety of multi-modal AI models. By offering a common platform for testing and defense, it will help researchers develop more robust and reliable AI systems that are less vulnerable to harmful manipulation and better aligned with human values.

Abstract

Recent advances in multi-modal large language models (MLLMs) have enabled unified perception-reasoning capabilities, yet these systems remain highly vulnerable to jailbreak attacks that bypass safety alignment and induce harmful behaviors. Existing benchmarks such as JailBreakV-28K, MM-SafetyBench, and HADES provide valuable insights into multi-modal vulnerabilities, but they typically focus on limited attack scenarios, lack standardized defense evaluation, and offer no unified, reproducible toolbox. To address these gaps, we introduce OmniSafeBench-MM, which is a comprehensive toolbox for multi-modal jailbreak attack-defense evaluation. OmniSafeBench-MM integrates 13 representative attack methods, 15 defense strategies, and a diverse dataset spanning 9 major risk domains and 50 fine-grained categories, structured across consultative, imperative, and declarative inquiry types to reflect realistic user intentions. Beyond data coverage, it establishes a three-dimensional evaluation protocol measuring (1) harmfulness, distinguished by a granular, multi-level scale ranging from low-impact individual harm to catastrophic societal threats, (2) intent alignment between responses and queries, and (3) response detail level, enabling nuanced safety-utility analysis. We conduct extensive experiments on 10 open-source and 8 closed-source MLLMs to reveal their vulnerability to multi-modal jailbreak. By unifying data, methodology, and evaluation into an open-source, reproducible platform, OmniSafeBench-MM provides a standardized foundation for future research. The code is released at https://github.com/jiaxiaojunQAQ/OmniSafeBench-MM.

View Paper