GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding

Rui Hu, Lianghui Zhu, Yuxuan Zhang, Tianheng Cheng, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang

2025-03-14

GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding

Summary

This paper talks about GroundingSuite, a tool that helps AI models get better at matching words to specific parts of images by creating a huge training dataset and testing system.

What's the problem?

Current AI models struggle to accurately link phrases to exact areas in pictures because existing training data has limited variety and lacks detailed labels.

What's the solution?

The researchers built an automated system that uses multiple AI helpers to quickly create millions of labeled examples, plus a test set to check how well models can match descriptions to image regions.

Why it matters?

This improves AI tools for tasks like describing photos for visually impaired users or helping robots understand instructions tied to specific objects in their environment.

Abstract

Pixel grounding, encompassing tasks such as Referring Expression Segmentation (RES), has garnered considerable attention due to its immense potential for bridging the gap between vision and language modalities. However, advancements in this domain are currently constrained by limitations inherent in existing datasets, including limited object categories, insufficient textual diversity, and a scarcity of high-quality annotations. To mitigate these limitations, we introduce GroundingSuite, which comprises: (1) an automated data annotation framework leveraging multiple Vision-Language Model (VLM) agents; (2) a large-scale training dataset encompassing 9.56 million diverse referring expressions and their corresponding segmentations; and (3) a meticulously curated evaluation benchmark consisting of 3,800 images. The GroundingSuite training dataset facilitates substantial performance improvements, enabling models trained on it to achieve state-of-the-art results. Specifically, a cIoU of 68.9 on gRefCOCO and a gIoU of 55.3 on RefCOCOm. Moreover, the GroundingSuite annotation framework demonstrates superior efficiency compared to the current leading data annotation method, i.e., 4.5 times faster than the GLaMM.

View Paper