MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

Huiyi Chen, Jiawei Peng, Dehai Min, Changchang Sun, Kaijie Chen, Yan Yan, Xu Yang, Lu Cheng

2025-11-19

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

Summary

This paper introduces a new way to test how well Large Vision-Language Models (LVLMs) – which are AI systems that understand both images and text – handle intentionally misleading images. It's about making sure these models are reliable when they're used in the real world.

What's the problem?

Currently, most tests for these models focus on tricking them with confusing text. However, it's just as important to see if they can be fooled by misleading *images*. If a model misinterprets what's in a picture, it can lead to incorrect answers or actions, even if the text part is perfectly understood. There wasn't a good, comprehensive test specifically designed to check this visual understanding weakness.

What's the solution?

The researchers created a benchmark called MVI-Bench, which contains over 1,200 question-and-answer pairs based on images. These images are designed to be misleading in different ways, focusing on things like confusing objects, attributes (like color or size), and relationships between objects. They also developed a new way to measure *how* easily a model is fooled by these misleading images, called MVI-Sensitivity, giving a detailed look at the model's weaknesses.

Why it matters?

This work is important because it shows that even the most advanced LVLMs are surprisingly vulnerable to misleading images. By identifying these weaknesses, the researchers hope to guide the development of more robust and trustworthy AI systems that can accurately understand the visual world around them, which is crucial for applications like self-driving cars or medical image analysis.

Abstract

Evaluating the robustness of Large Vision-Language Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate fine-grained robustness evaluation, we further introduce MVI-Sensitivity, a novel metric that characterizes LVLM robustness at a granular level. Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs, and our in-depth analyses on MVI-Bench provide actionable insights that can guide the development of more reliable and robust LVLMs. The benchmark and codebase can be accessed at https://github.com/chenyil6/MVI-Bench.

View Paper