VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

Yanling Wang, Yihan Zhao, Xiaodong Chen, Shasha Guo, Lixin Liu, Haoyang Li, Yong Xiao, Jing Zhang, Qi Li, Ke Xu

2025-03-12

VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large
Vision-Language Models in Fact-Seeking Question Answering

Summary

This paper talks about VisualSimpleQA, a test designed to check how well AI models that understand both images and text can answer fact-based questions by evaluating their vision and language skills separately.

What's the problem?

Current tests for AI models that handle images and text together don’t show clearly whether mistakes come from their ability to 'see' images or 'understand' language, making it hard to fix specific weaknesses.

What's the solution?

VisualSimpleQA splits the test into parts: one checks how well models answer questions using only text, and another measures how much worse they perform when images are added, while also creating extra-hard questions to push their limits.

Why it matters?

This helps developers improve AI’s vision and language skills separately, making models better at tasks like medical image analysis or educational tools where accuracy matters.

Abstract

Large vision-language models (LVLMs) have demonstrated remarkable achievements, yet the generation of non-factual responses remains prevalent in fact-seeking question answering (QA). Current multimodal fact-seeking benchmarks primarily focus on comparing model outputs to ground truth answers, providing limited insights into the performance of modality-specific modules. To bridge this gap, we introduce VisualSimpleQA, a multimodal fact-seeking benchmark with two key features. First, it enables streamlined and decoupled evaluation of LVLMs in visual and linguistic modalities. Second, it incorporates well-defined difficulty criteria to guide human annotation and facilitates the extraction of a challenging subset, VisualSimpleQA-hard. Experiments on 15 LVLMs show that even state-of-the-art models such as GPT-4o achieve merely 60%+ correctness in multimodal fact-seeking QA on VisualSimpleQA and 30%+ on VisualSimpleQA-hard. Furthermore, the decoupled evaluation across these models highlights substantial opportunities for improvement in both visual and linguistic modules. The dataset is available at https://huggingface.co/datasets/WYLing/VisualSimpleQA.

View Paper