ProGuard: Towards Proactive Multimodal Safeguard
Shaohan Yu, Lijun Li, Chenyang Si, Lu Sheng, Jing Shao
2025-12-30
Summary
This paper introduces ProGuard, a new system designed to identify and explain potentially harmful content generated by AI models that combine text and images.
What's the problem?
Current methods for making AI safer often react *after* a harmful output is created, requiring changes to the AI model itself. These methods struggle to keep up with the rapidly evolving types of risks that arise when AI can generate both images and text, and they can be biased towards certain types of content (like text versus images). Basically, existing safety measures aren't proactive enough and don't handle combined image and text well.
What's the solution?
The researchers created ProGuard, which works by first building a large, carefully balanced dataset of 87,000 examples of safe and unsafe content, covering both images, text, and combinations of both. This dataset helps avoid bias. Then, they trained an AI model using reinforcement learning – essentially rewarding it for correctly identifying and *describing* unsafe content, even content it hasn't seen before. A key part of their training involved encouraging the model to give short, clear explanations of why something is unsafe, using a 'synonym bank' to help with this.
Why it matters?
ProGuard is important because it can proactively detect and explain new types of safety risks in AI-generated content, without needing to constantly retrain the underlying AI model. It significantly improves the ability to identify unseen risks (a 52.6% improvement) and describe those risks clearly (a 64.8% improvement), performing as well as or better than existing safety systems.
Abstract
The rapid evolution of generative models has led to a continuous emergence of multimodal safety risks, exposing the limitations of existing defense methods. To address these challenges, we propose ProGuard, a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks without the need for model adjustments required by traditional reactive approaches. We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories under a hierarchical multimodal safety taxonomy, effectively mitigating modality bias and ensuring consistent moderation across text, image, and text-image inputs. Based on this dataset, we train our vision-language base model purely through reinforcement learning (RL) to achieve efficient and concise reasoning. To approximate proactive safety scenarios in a controlled setting, we further introduce an OOD safety category inference task and augment the RL objective with a synonym-bank-based similarity reward that encourages the model to generate concise descriptions for unseen unsafe categories. Experimental results show that ProGuard achieves performance comparable to closed-source large models on binary safety classification, substantially outperforms existing open-source guard models on unsafe content categorization. Most notably, ProGuard delivers a strong proactive moderation ability, improving OOD risk detection by 52.6% and OOD risk description by 64.8%.