Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

Li-Zhong Szu-Tu, Ting-Lin Wu, Chia-Jui Chang, He Syu, Yu-Lun Liu

2025-12-25

Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

Summary

This paper investigates whether powerful vision-language models (VLMs) actually *understand* what they're looking at, or if they're just really good at remembering famous things. The research shows these models are surprisingly reliant on memorization, performing much better on well-known buildings than on ordinary ones.

What's the problem?

Current VLMs, while impressive, might be succeeding by simply recognizing and recalling information about popular subjects instead of truly understanding visual concepts. This means they might fail when presented with something they haven't 'seen' before, revealing a lack of genuine reasoning ability. There wasn't a good way to systematically test this weakness, and existing datasets weren't large or detailed enough.

What's the solution?

The researchers created a new, large dataset called YearGuessr, containing over 55,000 images of buildings from around the world. Each image is labeled with when the building was constructed, its location, and how popular it is (measured by how many times its webpage is viewed). They then used this dataset to test how well VLMs could predict a building's construction year, focusing on whether performance differed based on the building's popularity. They also developed new ways to measure accuracy that specifically account for this popularity bias.

Why it matters?

This research is important because it highlights a significant limitation of current VLMs. If these models can't generalize beyond memorized information, their usefulness in real-world applications – where they'll inevitably encounter unfamiliar situations – is limited. Understanding this bias is crucial for developing more robust and truly intelligent AI systems that can reason and understand the world around them, rather than just recognizing what they've already 'learned'.

Abstract

We expose a significant popularity bias in state-of-the-art vision-language models (VLMs), which achieve up to 34% higher accuracy on famous buildings compared to ordinary ones, indicating a reliance on memorization over generalizable understanding. To systematically investigate this, we introduce the largest open benchmark for this task: the YearGuessr dataset, a collection of 55,546 building images with multi-modal attributes from 157 countries, annotated with continuous ordinal labels of their construction year (1001-2024), GPS data, and page-view counts as a proxy for popularity. Using this dataset, we frame the construction year prediction task as ordinal regression and introduce popularity-aware interval accuracy metrics to quantify this bias. Our resulting benchmark of 30+ models, including our YearCLIP model, confirms that VLMs excel on popular, memorized items but struggle significantly with unrecognized subjects, exposing a critical flaw in their reasoning capabilities. Project page: https://sytwu.github.io/BeyondMemo/

View Paper