Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

Nimrod Shabtay, Moshe Kimhi, Artem Spector, Sivan Haray, Ehud Rivlin, Chaim Baskin, Raja Giryes, Eli Schwartz

2026-03-24

Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

Summary

This paper introduces AwaRes, a new way for vision-language models – which are AI systems that understand both images and text – to work more efficiently without losing accuracy.

What's the problem?

Typically, these models need to look at images in very high detail to answer questions correctly, but processing all that detail takes a lot of computing power and time. If they look at lower-resolution images to save resources, they might miss important clues, like small words in a picture, and give wrong answers. It's a balancing act between speed and getting the right answer.

What's the solution?

AwaRes solves this by first getting a general overview of the image at a lower resolution. Then, when it needs to answer a specific question, it 'calls for' only the high-resolution parts of the image that are relevant to that question. The researchers created a way to automatically figure out *when* to ask for more detail and *where* to look for it, using a system that compares answers given with low and high resolution images and a model that pinpoints the important parts of the image. They then trained the system to learn this process, rewarding it for correct answers and penalizing it for using too much high-resolution processing.

Why it matters?

This is important because it means we can build vision-language models that are both accurate *and* practical. By only focusing on the necessary details, AwaRes can make these models faster and cheaper to run, opening up possibilities for using them in more places and on more devices, like phones or robots, without sacrificing performance.

Abstract

Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes

View Paper