Measuring Epistemic Humility in Multimodal Large Language Models

Bingkui Tong, Jiaer Xia, Sifeng Shang, Kaiyang Zhou

2025-09-16

Measuring Epistemic Humility in Multimodal Large Language Models

Summary

This paper introduces a new way to test how well AI models that look at both images and text, called multimodal large language models, avoid making things up. These models sometimes 'hallucinate,' meaning they generate answers that don't actually match what's in the image, which can be dangerous.

What's the problem?

Current tests for these AI models mostly check if they can pick the *right* answer from a set of choices. However, a really trustworthy AI should also be able to say 'none of the above' when *none* of the answers are correct. Existing benchmarks don't really measure this ability to recognize when it doesn't know, which is important for safety and reliability.

What's the solution?

The researchers created a new test called HumbleBench. They started with detailed descriptions of images, then used another AI (GPT-4-Turbo) to create multiple-choice questions about those images. Crucially, each question included a 'None of the above' option. They then carefully checked the questions to make sure they were good. This test specifically checks if the AI can reject incorrect answers, even if those answers seem reasonable.

Why it matters?

This work is important because it provides a more realistic way to evaluate these AI models. Knowing if an AI can admit when it's unsure is vital for using them in situations where mistakes could be harmful, like medical diagnosis or self-driving cars. HumbleBench helps us build more reliable and safer AI systems, and the test itself is available for others to use and improve.

Abstract

Hallucinations in multimodal large language models (MLLMs) -- where the model generates content inconsistent with the input image -- pose significant risks in real-world applications, from misinformation in visual question answering to unsafe errors in decision-making. Existing benchmarks primarily test recognition accuracy, i.e., evaluating whether models can select the correct answer among distractors. This overlooks an equally critical capability for trustworthy AI: recognizing when none of the provided options are correct, a behavior reflecting epistemic humility. We present HumbleBench, a new hallucination benchmark designed to evaluate MLLMs' ability to reject plausible but incorrect answers across three hallucination types: object, relation, and attribute. Built from a panoptic scene graph dataset, we leverage fine-grained scene graph annotations to extract ground-truth entities and relations, and prompt GPT-4-Turbo to generate multiple-choice questions, followed by a rigorous manual filtering process. Each question includes a "None of the above" option, requiring models not only to recognize correct visual information but also to identify when no provided answer is valid. We evaluate a variety of state-of-the-art MLLMs -- including both general-purpose and specialized reasoning models -- on HumbleBench and share valuable findings and insights with the community. By incorporating explicit false-option rejection, HumbleBench fills a key gap in current evaluation suites, providing a more realistic measure of MLLM reliability in safety-critical settings. Our code and dataset are released publicly and can be accessed at https://github.com/maifoundations/HumbleBench.

View Paper