Adversarial Attacks on Multimodal Agents

Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan

2024-06-19

Adversarial Attacks on Multimodal Agents

Summary

This paper discusses the safety risks associated with multimodal agents, which are AI systems that can process and act on both visual and text information. The authors demonstrate how these agents can be manipulated through specific attacks, highlighting potential vulnerabilities.

What's the problem?

As multimodal agents become more common, there are growing concerns about their safety. These agents rely on vision-enabled language models (VLMs) to understand and interact with their environments. However, attacking these agents is more complex than previous methods because attackers often have limited access to the environment and the model's inner workings. This creates a risk that these agents could be misled or manipulated, leading to unintended actions.

What's the solution?

The authors propose two types of attacks to exploit these vulnerabilities: the captioner attack and the CLIP attack. The captioner attack targets models that generate captions from images, allowing attackers to influence the agent's behavior by modifying the captions. The CLIP attack works by targeting a group of models that can be used with VLMs. They tested these attacks in a controlled environment called VisualWebArena-Adv, finding that the captioner attack could successfully manipulate the agent's actions 75% of the time. In contrast, when the captioner was removed or when the agent generated its own captions, the success rates for the CLIP attack were lower at 21% and 43%, respectively.

Why it matters?

This research is important because it highlights significant security issues in the deployment of multimodal agents. Understanding how these agents can be attacked helps developers create better defenses to protect against manipulation in real-world applications. As AI systems become more integrated into everyday life, ensuring their safety and reliability is crucial for user trust and effective operation.

Abstract

Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments. In this paper, we show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment. Our attacks use adversarial text strings to guide gradient-based perturbation over one trigger image in the environment: (1) our captioner attack attacks white-box captioners if they are used to process images into captions as additional inputs to the VLM; (2) our CLIP attack attacks a set of CLIP models jointly, which can transfer to proprietary VLMs. To evaluate the attacks, we curated VisualWebArena-Adv, a set of adversarial tasks based on VisualWebArena, an environment for web-based multimodal agent tasks. Within an L-infinity norm of 16/256 on a single image, the captioner attack can make a captioner-augmented GPT-4V agent execute the adversarial goals with a 75% success rate. When we remove the captioner or use GPT-4V to generate its own captions, the CLIP attack can achieve success rates of 21% and 43%, respectively. Experiments on agents based on other VLMs, such as Gemini-1.5, Claude-3, and GPT-4o, show interesting differences in their robustness. Further analysis reveals several key factors contributing to the attack's success, and we also discuss the implications for defenses as well. Project page: https://chenwu.io/attack-agent Code and data: https://github.com/ChenWu98/agent-attack

View Paper