Pay Less Attention to Function Words for Free Robustness of Vision-Language Models
Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Chao Shen
2025-12-11
Summary
This paper focuses on making Visual Language Models (VLMs) more reliable when faced with tricky, intentionally misleading inputs, also known as adversarial attacks. It specifically looks at how common 'function words' like 'a', 'the', and 'in' can make these models vulnerable.
What's the problem?
VLMs, which combine image and text understanding, can be easily fooled by small changes to the text, especially when those changes involve function words. These models are often good at overall performance, but lack 'robustness' – meaning they aren't consistent when presented with slightly altered inputs. The issue is that these function words seem to be exploited during attacks, causing the model to make incorrect predictions.
What's the solution?
The researchers developed a technique called Function-word De-Attention (FDA). Think of it like a noise filter. FDA works within the model's attention mechanism, which is how it focuses on important parts of the image and text. It calculates how much the model is paying attention to function words and then *subtracts* that attention from the overall focus. This makes the model less sensitive to those vulnerable words, leading to more stable and accurate results. It's similar to how a differential amplifier works in electronics, focusing on the difference between signals.
Why it matters?
This work is important because it improves the reliability of VLMs. The researchers showed significant improvements in resisting attacks across different models, datasets, and tasks, with only a small trade-off in normal performance. This means VLMs can be used more confidently in real-world applications where security and consistent performance are crucial, like image search or robotic navigation. The fact that it works well even on tasks it hasn't been specifically trained for (zero-shot performance) is also a big plus.
Abstract
To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code will be made publicly at https://github.com/michaeltian108/FDA.