< Explain other AI papers

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

Rang Li, Lei Li, Shuhuai Ren, Hao Tian, Shuhao Gu, Shicheng Li, Zihao Yue, Yudong Wang, Wenhan Ma, Zhe Yang, Jingyuan Ma, Zhifang Sui, Fuli Luo

2025-12-22

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

Summary

This paper investigates how well current artificial intelligence models, specifically those that combine language and vision, can actually 'understand' what they're looking at when given instructions in natural language. It questions whether these models are truly grounding language in the visual world or just recognizing patterns in simplified tests.

What's the problem?

Existing tests for visual grounding aren't challenging enough to reveal whether AI models genuinely understand visual scenes like humans do. Humans can easily handle ambiguous descriptions, recognize when something *can't* be found in an image, and distinguish between very similar objects. Current benchmarks don't test these abilities, leading to an overestimation of AI capabilities and potential safety issues when these models are used in real-world applications.

What's the solution?

The researchers created a new, more difficult benchmark called GroundingME with over a thousand examples. This benchmark tests models on four key areas: telling apart similar objects, understanding spatial relationships, dealing with partially hidden or small objects, and knowing when a request is impossible to fulfill. They tested 25 different AI models and found they performed poorly, especially at recognizing when an object wasn't present, often making things up instead. They then explored two ways to improve performance: letting the model consider multiple possible answers before choosing, and training the model on examples where grounding isn't possible.

Why it matters?

This work is important because it shows that current AI models still have significant limitations in visual understanding. The new benchmark, GroundingME, provides a more realistic and rigorous way to evaluate these models and guides future research towards building AI systems that can truly understand and interact with the visual world in a safe and reliable manner.

Abstract

Visual grounding, localizing objects from natural language descriptions, represents a critical bridge between language and vision understanding. While multimodal large language models (MLLMs) achieve impressive scores on existing benchmarks, a fundamental question remains: can MLLMs truly ground language in vision with human-like sophistication, or are they merely pattern-matching on simplified datasets? Current benchmarks fail to capture real-world complexity where humans effortlessly navigate ambiguous references and recognize when grounding is impossible. To rigorously assess MLLMs' true capabilities, we introduce GroundingME, a benchmark that systematically challenges models across four critical dimensions: (1) Discriminative, distinguishing highly similar objects, (2) Spatial, understanding complex relational descriptions, (3) Limited, handling occlusions or tiny objects, and (4) Rejection, recognizing ungroundable queries. Through careful curation combining automated generation with human verification, we create 1,005 challenging examples mirroring real-world complexity. Evaluating 25 state-of-the-art MLLMs reveals a profound capability gap: the best model achieves only 45.1% accuracy, while most score 0% on rejection tasks, reflexively hallucinating objects rather than acknowledging their absence, raising critical safety concerns for deployment. We explore two strategies for improvements: (1) test-time scaling selects optimal response by thinking trajectory to improve complex grounding by up to 2.9%, and (2) data-mixture training teaches models to recognize ungroundable queries, boosting rejection accuracy from 0% to 27.9%. GroundingME thus serves as both a diagnostic tool revealing current limitations in MLLMs and a roadmap toward human-level visual grounding.