Adversarial Confusion Attack: Disrupting Multimodal Large Language Models

Jakub Hoscilowicz, Artur Janicki

2025-12-04

Adversarial Confusion Attack: Disrupting Multimodal Large Language Models

Summary

This research introduces a new way to mess with AI systems that use both text and images, called the Adversarial Confusion Attack. It's not about making the AI say bad things or misidentify something specific, but rather about making it generally confused and produce nonsensical or confidently wrong answers.

What's the problem?

Current AI models that handle images and text together, known as multimodal large language models, are vulnerable to being tricked. Existing attacks usually focus on getting the AI to do something it shouldn't, like reveal secret information, or misclassify an image. This paper addresses the problem of reliably disrupting the AI's ability to function correctly, making it unreliable for real-world tasks.

What's the solution?

The researchers developed a method to create slightly altered images – changes you likely wouldn't even notice – that cause these AI models to become confused. They do this by finding changes to an image that maximize the uncertainty of the AI’s next prediction. They tested this using a group of freely available AI models and found that a single altered image could disrupt many different models, even some that aren't publicly available, like a version of GPT. They used a standard technique for creating these alterations, showing it doesn't require complex methods to be effective.

Why it matters?

This is important because as AI systems become more integrated into our lives, like powering agents on websites, we need to be sure they are reliable. This attack demonstrates a way to subtly sabotage these systems, potentially disrupting their operation. Imagine a website embedding these altered images to prevent AI assistants from working correctly – that’s a real-world concern this research highlights, and shows the need for better defenses against these kinds of attacks.

Abstract

We introduce the Adversarial Confusion Attack, a new class of threats against multimodal large language models (MLLMs). Unlike jailbreaks or targeted misclassification, the goal is to induce systematic disruption that makes the model generate incoherent or confidently incorrect outputs. Practical applications include embedding such adversarial images into websites to prevent MLLM-powered AI Agents from operating reliably. The proposed attack maximizes next-token entropy using a small ensemble of open-source MLLMs. In the white-box setting, we show that a single adversarial image can disrupt all models in the ensemble, both in the full-image and Adversarial CAPTCHA settings. Despite relying on a basic adversarial technique (PGD), the attack generates perturbations that transfer to both unseen open-source (e.g., Qwen3-VL) and proprietary (e.g., GPT-5.1) models.

View Paper