OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs
Xin Wang, Yunhao Chen, Juncheng Li, Yixu Wang, Yang Yao, Tianle Gu, Jie Li, Yan Teng, Xingjun Ma, Yingchun Wang, Xia Hu
2026-01-07
Summary
This paper introduces OpenRT, a new system for thoroughly testing the safety of advanced AI models that can understand both text and images, known as Multimodal Large Language Models (MLLMs). It's designed to find weaknesses and vulnerabilities in these models before they're widely used.
What's the problem?
Currently, testing the safety of these powerful AI models is difficult because the methods used are often incomplete, only test simple interactions, and can't easily be scaled up to test many different models and attacks. Existing tests don't cover enough ground to reliably ensure these AIs won't be misused or produce harmful outputs. It's hard to systematically find and fix safety issues.
What's the solution?
The researchers created OpenRT, a flexible and efficient framework for 'red-teaming' – essentially, trying to trick the AI into doing something it shouldn't. OpenRT breaks down the testing process into separate parts, like choosing the AI model, the attack method, and how to judge if the attack was successful. This allows for a huge number of different tests to be run quickly and easily. They included 37 different attack strategies, and tested 20 leading AI models like GPT-5.2 and Gemini 3 Pro.
Why it matters?
This work is important because it reveals that even the most advanced AI models still have significant safety flaws. The tests showed that these models aren't very good at defending against complex attacks, and simply being a 'reasoning' model doesn't automatically make it safer. By making OpenRT publicly available, the researchers hope to encourage more research into AI safety and help create better, more reliable AI systems.
Abstract
The rapid integration of Multimodal Large Language Models (MLLMs) into critical applications is increasingly hindered by persistent safety vulnerabilities. However, existing red-teaming benchmarks are often fragmented, limited to single-turn text interactions, and lack the scalability required for systematic evaluation. To address this, we introduce OpenRT, a unified, modular, and high-throughput red-teaming framework designed for comprehensive MLLM safety evaluation. At its core, OpenRT architects a paradigm shift in automated red-teaming by introducing an adversarial kernel that enables modular separation across five critical dimensions: model integration, dataset management, attack strategies, judging methods, and evaluation metrics. By standardizing attack interfaces, it decouples adversarial logic from a high-throughput asynchronous runtime, enabling systematic scaling across diverse models. Our framework integrates 37 diverse attack methodologies, spanning white-box gradients, multi-modal perturbations, and sophisticated multi-agent evolutionary strategies. Through an extensive empirical study on 20 advanced models (including GPT-5.2, Claude 4.5, and Gemini 3 Pro), we expose critical safety gaps: even frontier models fail to generalize across attack paradigms, with leading models exhibiting average Attack Success Rates as high as 49.14%. Notably, our findings reveal that reasoning models do not inherently possess superior robustness against complex, multi-turn jailbreaks. By open-sourcing OpenRT, we provide a sustainable, extensible, and continuously maintained infrastructure that accelerates the development and standardization of AI safety.