TopoPerception: A Shortcut-Free Evaluation of Global Visual Perception in Large Vision-Language Models

Wenhao Zhou, Hao Zheng, Rong Zhao

2025-11-19

TopoPerception: A Shortcut-Free Evaluation of Global Visual Perception in Large Vision-Language Models

Summary

This paper investigates how well large vision-language models (LVLMs) actually 'see' and understand images as a whole, rather than just recognizing objects within them.

What's the problem?

Current LVLMs combine image understanding with powerful language processing, but the image understanding part seems to be holding them back. Existing tests for these models often have easy tricks or shortcuts that allow them to get good scores without truly understanding the overall scene. These shortcuts don't accurately measure if the model can grasp the global structure of an image.

What's the solution?

The researchers created a new test called TopoPerception. This test focuses on the *topology* of an image – basically, how different parts are connected and arranged. Topology isn't affected by small details, so it forces the model to understand the big picture. They tested several state-of-the-art models with TopoPerception and found they performed very poorly, often no better than guessing, even on simple tasks. Surprisingly, bigger and more powerful models didn't do better, and sometimes even did worse.

Why it matters?

This research shows that current LVLMs have a fundamental weakness in understanding images globally. Simply making the models larger isn't solving the problem. It suggests that we need to rethink how these models are trained or even design new types of architectures to improve their ability to truly 'see' and interpret visual information, not just identify objects.

Abstract

Large Vision-Language Models (LVLMs) typically align visual features from an encoder with a pre-trained Large Language Model (LLM). However, this makes the visual perception module a bottleneck, which constrains the overall capabilities of LVLMs. Conventional evaluation benchmarks, while rich in visual semantics, often contain unavoidable local shortcuts that can lead to an overestimation of models' perceptual abilities. Here, we introduce TopoPerception, a benchmark that leverages topological properties to rigorously evaluate the global visual perception capabilities of LVLMs across various granularities. Since topology depends on the global structure of an image and is invariant to local features, TopoPerception enables a shortcut-free assessment of global perception, fundamentally distinguishing it from semantically rich tasks. We evaluate state-of-the-art models on TopoPerception and find that even at the coarsest perceptual granularity, all models perform no better than random chance, indicating a profound inability to perceive global visual features. Notably, a consistent trend emerge within model families: more powerful models with stronger reasoning capabilities exhibit lower accuracy. This suggests that merely scaling up models is insufficient to address this deficit and may even exacerbate it. Progress may require new training paradigms or architectures. TopoPerception not only exposes a critical bottleneck in current LVLMs but also offers a lens and direction for improving their global visual perception. The data and code are publicly available at: https://github.com/Wenhao-Zhou/TopoPerception.

View Paper