CaptionQA: Is Your Caption as Useful as the Image Itself?

Shijia Yang, Yunong Liu, Bohan Zhai, Ximeng Sun, Zicheng Liu, Emad Barsoum, Manling Li, Chenfeng Xu

2025-12-01

CaptionQA: Is Your Caption as Useful as the Image Itself?

Summary

This paper introduces a new way to test how good image captions are, not just by how well they *read*, but by how useful they are for actually doing things like answering questions or helping a computer understand what's in a picture.

What's the problem?

Currently, we evaluate image captions based on how well they describe the image, often using metrics that focus on language quality. However, no one really checks if those captions actually contain all the important information needed to perform tasks that rely on understanding the image. A caption might sound good, but if it misses key details, it's not very helpful for a computer trying to use it.

What's the solution?

The researchers created a benchmark called CaptionQA. This benchmark includes a huge collection of multiple-choice questions about images, but the trick is that you can only answer the questions using the image's caption – you can't look at the image itself. They tested how well different AI models could answer these questions using only the captions, which shows how much useful information is actually *in* those captions. They covered four different areas – everyday scenes, documents, online shopping, and situations for robots to navigate.

Why it matters?

This work is important because it shows that current AI models, even those that are good at understanding both images and text together, often create captions that lose important visual details. This means that while the models seem to be doing well on standard tests, they aren't actually creating captions that are truly useful for real-world applications where a computer needs to 'see' and understand the world through text alone. It highlights a gap in how we evaluate captions and pushes for better captioning methods.

Abstract

Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, and multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains--Natural, Document, E-commerce, and Embodied AI--each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption utility. In our evaluation protocol, an LLM answers these questions using captions alone, directly measuring whether captions preserve image-level utility and are utilizable by a downstream LLM. Evaluating state-of-the-art MLLMs reveals substantial gaps between the image and its caption utility. Notably, models nearly identical on traditional image-QA benchmarks lower by up to 32% in caption utility. We release CaptionQA along with an open-source pipeline for extension to new domains. The code is available at https://github.com/bronyayang/CaptionQA.

View Paper