Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li, Huijia Zhu, Weiqiang Wang, Linghe Kong, Yue Wang, Zhuosheng Zhang, Weiran Huang

2026-02-16

Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

Summary

This paper focuses on improving how well large AI models that can 'see' and 'understand' images can pick up on small, important details within those images, a capability called fine-grained perception.

What's the problem?

Current AI models are good at generally understanding what's in an image, but they often miss crucial details because those details are small and get lost when the model looks at the whole picture at once. Existing methods try to solve this by having the AI repeatedly zoom in and out of different areas, but this process is slow and takes a lot of computing power.

What's the solution?

The researchers developed a new training technique called Region-to-Image Distillation. Instead of making the AI zoom during use, they taught it to focus on important regions during its initial training. They showed the AI close-up images of key areas and had it learn to answer questions about those specific regions. Then, they trained the AI to apply this knowledge when looking at the full image, allowing it to quickly identify details without needing to 'zoom' in during use. They also created a new set of test questions, called ZoomBench, specifically designed to measure this fine-grained perception ability.

Why it matters?

This work is important because it makes these AI models faster and more efficient at understanding images. By teaching the AI to focus on details during training, they eliminate the need for slow, repeated zooming. This improvement isn't just limited to detail-oriented tasks; it also boosts the AI’s overall ability to reason about images and interact with visual interfaces, like those found in apps and websites.

Abstract

Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves "single-glance" fine-grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zooming gap". Experiments show that our models achieve leading performance across multiple fine-grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when "Thinking-with-Images" is necessary versus when its gains can be distilled into a single forward pass. Our code is available at https://github.com/inclusionAI/Zooming-without-Zooming.

View Paper