Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models
Yijun Yang, Lichao Wang, Jianping Zhang, Chi Harold Liu, Lanqing Hong, Qiang Xu
2025-11-24
Summary
This paper investigates how easily Vision-Language Models (VLMs) – AI systems that understand both images and text – can be tricked into doing things they shouldn't, even with safety measures in place.
What's the problem?
VLMs are becoming more powerful, but also more prone to misuse. Developers are trying to protect them with things like special instructions and content filters, but it's unclear how well these defenses actually work against clever attacks. Essentially, just because a VLM *seems* safe doesn't mean it *is* safe when someone actively tries to break it.
What's the solution?
The researchers created a new attack method called Multi-Faceted Attack (MFA). This attack cleverly hides harmful requests within seemingly harmless tasks that have multiple goals. It's like asking the VLM to both describe a picture *and* follow a secret, dangerous instruction at the same time. A key part of MFA, called Attention-Transfer Attack, exploits how these models process information. They also found a way to make attacks work across different VLMs, even if they're built differently, by focusing on shared visual understandings and using a simple repetition trick. They tested this on models like GPT-4o, Gemini, and Llama-4.
Why it matters?
This research shows that current safety measures for VLMs are not as strong as people might think. The fact that attacks can easily transfer between different models suggests a fundamental weakness in how these systems process visual information. This is important because it highlights the need for better, more robust safety mechanisms to prevent VLMs from being used for harmful purposes.
Abstract
The growing misuse of Vision-Language Models (VLMs) has led providers to deploy multiple safeguards, including alignment tuning, system prompts, and content moderation. However, the real-world robustness of these defenses against adversarial attacks remains underexplored. We introduce Multi-Faceted Attack (MFA), a framework that systematically exposes general safety vulnerabilities in leading defense-equipped VLMs such as GPT-4o, Gemini-Pro, and Llama-4. The core component of MFA is the Attention-Transfer Attack (ATA), which hides harmful instructions inside a meta task with competing objectives. We provide a theoretical perspective based on reward hacking to explain why this attack succeeds. To improve cross-model transferability, we further introduce a lightweight transfer-enhancement algorithm combined with a simple repetition strategy that jointly bypasses both input-level and output-level filters without model-specific fine-tuning. Empirically, we show that adversarial images optimized for one vision encoder transfer broadly to unseen VLMs, indicating that shared visual representations create a cross-model safety vulnerability. Overall, MFA achieves a 58.5% success rate and consistently outperforms existing methods. On state-of-the-art commercial models, MFA reaches a 52.8% success rate, surpassing the second-best attack by 34%. These results challenge the perceived robustness of current defense mechanisms and highlight persistent safety weaknesses in modern VLMs. Code: https://github.com/cure-lab/MultiFacetedAttack