Hyperbolic Safety-Aware Vision-Language Models
Tobia Poppi, Tejaswi Kasarla, Pascal Mettes, Lorenzo Baraldi, Rita Cucchiara
2025-03-19
Summary
This paper is about making AI image systems safer by teaching them to understand the difference between safe and unsafe content.
What's the problem?
AI image systems can sometimes show unsafe or inappropriate content. Previous attempts to fix this involved trying to erase the AI's knowledge of bad things, but this also made it harder for the AI to understand the world.
What's the solution?
The researchers created a new system called HySAC that organizes information in a way that separates safe and unsafe content, allowing the AI to understand the difference and avoid showing unsafe images.
Why it matters?
This work is important because it helps make AI image systems safer and more reliable, which is essential for real-world applications.
Abstract
Addressing the retrieval of unsafe content from vision-language models such as CLIP is an important step towards real-world integration. Current efforts have relied on unlearning techniques that try to erase the model's knowledge of unsafe concepts. While effective in reducing unwanted outputs, unlearning limits the model's capacity to discern between safe and unsafe content. In this work, we introduce a novel approach that shifts from unlearning to an awareness paradigm by leveraging the inherent hierarchical properties of the hyperbolic space. We propose to encode safe and unsafe content as an entailment hierarchy, where both are placed in different regions of hyperbolic space. Our HySAC, Hyperbolic Safety-Aware CLIP, employs entailment loss functions to model the hierarchical and asymmetrical relations between safe and unsafe image-text pairs. This modelling, ineffective in standard vision-language models due to their reliance on Euclidean embeddings, endows the model with awareness of unsafe content, enabling it to serve as both a multimodal unsafe classifier and a flexible content retriever, with the option to dynamically redirect unsafe queries toward safer alternatives or retain the original output. Extensive experiments show that our approach not only enhances safety recognition but also establishes a more adaptable and interpretable framework for content moderation in vision-language models. Our source code is available at https://github.com/aimagelab/HySAC.