Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models

Mengqi He, Xinyu Tian, Xin Shen, Jinhong Ni, Shu Zou, Zhaoyuan Yang, Jing Zhang

2026-01-09

Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models

Summary

This research focuses on how easily vision-language models, which are good at understanding both images and text, can be tricked into producing harmful outputs through carefully crafted changes to the input. It finds that these models aren't as safe as we might think, and a surprisingly small number of points during their text generation process are key to causing problems.

What's the problem?

Vision-language models are vulnerable to 'adversarial attacks' – small changes to the input that cause the model to make mistakes. Existing attacks try to maximize the model's uncertainty across *all* parts of the text it generates, assuming every word is equally important for causing errors. However, this isn't efficient, and it turns out only a small portion of the words the model chooses are actually critical for determining the final output, and therefore the most vulnerable to attack.

What's the solution?

The researchers discovered that focusing attacks on just 20% of the words the model is most uncertain about – the 'critical decision points' – is just as effective as attacking everything, but uses far fewer resources. They developed a new attack method, called EGA, that specifically targets these high-entropy points. This attack consistently converts a significant number of harmless outputs into harmful ones, and even works well on different types of vision-language models.

Why it matters?

This work highlights a serious safety flaw in current vision-language models. It shows that these models are surprisingly easy to manipulate into generating harmful content, and that the vulnerabilities are consistent across different model designs. This means that even advanced models aren't necessarily safe, and we need better methods to protect against these kinds of attacks to prevent misuse.

Abstract

Vision-language models (VLMs) achieve remarkable performance but remain vulnerable to adversarial attacks. Entropy, a measure of model uncertainty, is strongly correlated with the reliability of VLM. Prior entropy-based attacks maximize uncertainty at all decoding steps, implicitly assuming that every token contributes equally to generation instability. We show instead that a small fraction (about 20%) of high-entropy tokens, i.e., critical decision points in autoregressive generation, disproportionately governs output trajectories. By concentrating adversarial perturbations on these positions, we achieve semantic degradation comparable to global methods while using substantially smaller budgets. More importantly, across multiple representative VLMs, such selective attacks convert 35-49% of benign outputs into harmful ones, exposing a more critical safety risk. Remarkably, these vulnerable high-entropy forks recur across architecturally diverse VLMs, enabling feasible transferability (17-26% harmful rates on unseen targets). Motivated by these findings, we propose Entropy-bank Guided Adversarial attacks (EGA), which achieves competitive attack success rates (93-95%) alongside high harmful conversion, thereby revealing new weaknesses in current VLM safety mechanisms.

View Paper