Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts
Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, Saining Xie
2025-11-07
Summary
This paper investigates a problem with how we test Multimodal Large Language Models (MLLMs), which are AI systems that can understand both text and images. The researchers found that these models often do well on tests not because they actually *understand* the images, but because they're exploiting tricks and patterns in the test questions themselves.
What's the problem?
Currently, many benchmarks used to evaluate MLLMs aren't truly testing visual understanding. Models can achieve high scores by relying on things like pre-existing knowledge of language, biases in the data, or superficial connections between the text and images, rather than actually 'seeing' and interpreting the visual information. This is a big issue because it makes it hard to know if these models are genuinely improving in their ability to process visual data, especially on tests designed to specifically require visual input.
What's the solution?
The researchers propose a new way to design and improve these tests. They suggest that benchmark creators should actively try to 'beat' their own tests using only the text parts of the questions. This helps identify weaknesses and biases. They developed two main tools: a 'Test-set Stress-Test' which involves training a language model on just the text of the test questions to see how well it can do without looking at the images, and an 'Iterative Bias Pruning' method to remove questions that are easily solved without visual understanding. They applied this to several existing benchmarks and created a debiased version of one, called VSI-Bench-Debiased.
Why it matters?
This work is important because it highlights the need for more reliable ways to evaluate MLLMs. If we can't accurately measure a model's visual understanding, we can't be sure it's truly intelligent or safe to use in real-world applications. By identifying and removing biases in benchmarks, we can push developers to create models that genuinely 'see' and understand the world around them, rather than just appearing to.
Abstract
Robust benchmarks are crucial for evaluating Multimodal Large Language Models (MLLMs). Yet we find that models can ace many multimodal benchmarks without strong visual understanding, instead exploiting biases, linguistic priors, and superficial patterns. This is especially problematic for vision-centric benchmarks that are meant to require visual inputs. We adopt a diagnostic principle for benchmark design: if a benchmark can be gamed, it will be. Designers should therefore try to ``game'' their own benchmarks first, using diagnostic and debiasing procedures to systematically identify and mitigate non-visual biases. Effective diagnosis requires directly ``training on the test set'' -- probing the released test set for its intrinsic, exploitable patterns. We operationalize this standard with two components. First, we diagnose benchmark susceptibility using a ``Test-set Stress-Test'' (TsT) methodology. Our primary diagnostic tool involves fine-tuning a powerful Large Language Model via k-fold cross-validation on exclusively the non-visual, textual inputs of the test set to reveal shortcut performance and assign each sample a bias score s(x). We complement this with a lightweight Random Forest-based diagnostic operating on hand-crafted features for fast, interpretable auditing. Second, we debias benchmarks by filtering high-bias samples using an ``Iterative Bias Pruning'' (IBP) procedure. Applying this framework to four benchmarks -- VSI-Bench, CV-Bench, MMMU, and VideoMME -- we uncover pervasive non-visual biases. As a case study, we apply our full framework to create VSI-Bench-Debiased, demonstrating reduced non-visual solvability and a wider vision-blind performance gap than the original.