Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning

Yuyao Ge, Shenghua Liu, Yiwei Wang, Lingrui Mei, Baolong Bi, Xuanshan Zhou, Jiayu Yao, Jiafeng Guo, Xueqi Cheng

2025-09-09

Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning

Summary

This paper investigates why Vision-Language Models (VLMs), which are good at understanding images and text together, struggle with complicated images and proposes a way to improve their performance without needing to retrain them.

What's the problem?

VLMs often see a drop in accuracy when dealing with visually complex scenes. Current methods to fix this either require a lot of extra training, need separate tools to identify important parts of the image, or don't focus on the details enough. The core issue is that complex images seem to confuse the model's attention, making it harder to reason about what's happening in the picture.

What's the solution?

The researchers discovered that the way VLMs pay attention to images changes with complexity. In simple images, attention quickly focuses on key areas, but in complex images, attention is more scattered. They realized that by comparing how the model attends to a general question versus a specific task, they could separate the useful visual information from the distracting 'noise'. They developed a technique called CARVE that uses this comparison to refine the model’s attention, highlighting the important parts of the image at a pixel level, all without any additional training.

Why it matters?

This work is important because it provides a new understanding of how VLMs process images and identifies a simple, effective way to boost their performance on difficult visual tasks. CARVE offers a practical solution for improving visual reasoning in VLMs without the cost and complexity of retraining, potentially making these models more reliable in real-world applications.

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments. While existing enhancement approaches require additional training, rely on external segmentation tools, or operate at coarse-grained levels, they overlook the innate ability within VLMs. To bridge this gap, we investigate VLMs' attention patterns and discover that: (1) visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance; (2) attention progressively refines from global scanning in shallow layers to focused convergence in deeper layers, with convergence degree determined by visual complexity. (3) Theoretically, we prove that the contrast of attention maps between general queries and task-specific queries enables the decomposition of visual signal into semantic signals and visual noise components. Building on these insights, we propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level. Extensive experiments demonstrate that CARVE consistently enhances performance, achieving up to 75% improvement on open-source models. Our work provides critical insights into the interplay between visual complexity and attention mechanisms, offering an efficient pathway for improving visual reasoning with contrasting attention.

View Paper