Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning

Qinghao Ye, Xianhan Zeng, Fu Li, Chunyuan Li, Haoqi Fan

2025-03-21

Painting with Words: Elevating Detailed Image Captioning with Benchmark
and Alignment Learning

Summary

This paper is about making AI better at describing images with lots of detail.

What's the problem?

It's hard to tell if an AI is doing a good job at describing all the important details in an image because the current ways of measuring this aren't very good, and the AI can sometimes make things up.

What's the solution?

The researchers created a new test and a new way to measure how well the AI describes images, focusing on whether it includes all the important details and avoids making things up. They also developed a way for the AI to learn from feedback to improve its descriptions.

Why it matters?

This work matters because it can lead to AI that can accurately describe images with a high level of detail, which is useful for many applications.

Abstract

Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions. However, the evaluation of detailed image captioning remains underexplored due to outdated evaluation metrics and coarse annotations. In this paper, we introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks. DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units, termed primitive information units, and assessing them individually. Our evaluation shows that DCScore aligns more closely with human judgment than other rule-based or model-based metrics. Concurrently, DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models. Additionally, we present an automatic fine-grained feedback collection method, FeedQuill, for preference optimization based on our advanced metric, showing robust generalization capabilities across auto-generated preference data. Extensive experiments on multiple VLMs demonstrate that our method not only significantly reduces hallucinations but also enhances performance across various benchmarks, achieving superior detail captioning performance while surpassing GPT-4o.

View Paper