Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment
Xiaojun Jia, Sensen Gao, Simeng Qin, Tianyu Pang, Chao Du, Yihao Huang, Xinfeng Li, Yiming Li, Bo Li, Yang Liu
2025-05-28
Summary
This paper talks about a new attack method called FOA-Attack that is designed to trick powerful AI models that work with both text and images, even when the details of how these models work are kept secret.
What's the problem?
The problem is that multimodal large language models, which can understand both pictures and words, are supposed to be secure, especially when their inner workings are hidden from the public. However, attackers have found ways to create special inputs, called adversarial examples, that can fool these models into making mistakes or giving wrong answers.
What's the solution?
To address this, the researchers developed FOA-Attack, which works by carefully adjusting both the overall (global) and detailed (local) features of the input using mathematical techniques like cosine similarity and optimal transport. This makes the attack more likely to work on different models, even if the attacker doesn't know exactly how the target model was built.
Why it matters?
This is important because it shows that even closed-source, supposedly secure AI systems can still be vulnerable to these kinds of attacks. Understanding these weaknesses helps researchers and developers create better defenses to protect AI systems from being tricked or misused.
Abstract
A method named FOA-Attack is proposed to enhance adversarial transferability in multimodal large language models by optimizing both global and local feature alignments using cosine similarity and optimal transport.