AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
Zichuan Lin, Yicheng Liu, Yang Yang, Lvfang Tao, Deheng Ye
2025-12-04
Summary
This paper introduces a new way to make Vision-Language Models, which are good at answering questions about images, more efficient. It focuses on reducing the amount of visual information the model needs to process without sacrificing accuracy.
What's the problem?
Current Vision-Language Models need to look at a lot of detail in an image – broken down into 'visual tokens' – which takes a lot of computing power. Existing methods try to simplify this by just reducing the number of tokens, but they do so in a fixed way and can't adjust based on how difficult the question or image is. Essentially, they don't intelligently decide *what* parts of the image are most important to look at.
What's the solution?
The researchers developed a model called AdaptVision that works like human vision. It starts by looking at a low-resolution, compressed version of the image. If it needs more detail to answer the question, it actively 'zooms in' on specific areas using a tool that identifies important regions, like drawing a box around them. This is done using reinforcement learning, where the model learns to balance getting the right answer with using as few visual tokens as possible. A key part of their learning process is separating how the model learns to use the 'zoom' tool from how it learns to improve its answers, making the learning more effective.
Why it matters?
This research is important because it makes these powerful image-understanding models more practical. By reducing the computational cost, they can be used on devices with less processing power and can answer questions faster. It also moves closer to how humans actually process images, focusing attention on the most relevant parts instead of trying to analyze everything at once.
Abstract
Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.