Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, Zhaoxiang Zhang

2025-10-22

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

Summary

This paper introduces a new approach, called Grasp Any Region (GAR), to help AI models better understand images by focusing on specific areas within them and how those areas relate to the whole picture.

What's the problem?

Current AI models that understand both images and text, known as Multimodal Large Language Models, are good at getting the general idea of a scene, but they often miss important details and how different objects interact with each other. Existing methods that *do* focus on specific regions of an image treat those regions as if they exist in isolation, ignoring the broader context which is crucial for true understanding.

What's the solution?

The researchers developed GAR, which uses a technique to 'replay' important information from the entire image when analyzing a specific region. This allows the AI to understand the region *in context*. GAR also allows the model to handle multiple questions or prompts about different regions simultaneously, and then use that information to reason about complex relationships and answer detailed questions about the image. They also created a new benchmark, GAR-Bench, to specifically test these abilities.

Why it matters?

This work is important because it moves AI beyond simply describing what's in an image to actively understanding it and reasoning about the relationships between different parts. The results show GAR performs better than existing models on understanding image details and relationships, and even transfers well to understanding videos, suggesting a significant step forward in visual AI.

Abstract

While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehen- sive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Extensive experiments have demonstrated that GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong capabilities can be easily transferred to videos.

View Paper