FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

Hang Hua, Qing Liu, Lingzhi Zhang, Jing Shi, Zhifei Zhang, Yilin Wang, Jianming Zhang, Jiebo Luo

2024-11-27

FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

Summary

This paper introduces FINECAPTION, a new model designed to help computers create detailed captions for images by understanding specific regions within those images at different levels of detail.

What's the problem?

Current Vision-Language Models (VLMs) have trouble accurately describing specific parts of images because they can't effectively match image segments with their meanings. This limitation makes it hard for these models to provide detailed and coherent captions that reflect the complexities of what is shown in an image.

What's the solution?

To solve this problem, the authors developed FINECAPTION, which can recognize different parts of an image based on user-defined masks and generate captions that describe these areas in detail. They also created a new dataset called COMPOSITIONCAP, which includes many examples of images with specific descriptions for different regions. This allows the model to learn how to provide rich, informative captions that consider various attributes of the objects in the image.

Why it matters?

This work is important because it improves how machines understand and describe images, which can enhance applications like image search engines, automated content creation, and assistive technologies. By enabling more accurate and nuanced image captioning, FINECAPTION helps bridge the gap between visual information and language understanding.

Abstract

The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal tasks, enabling more sophisticated and accurate reasoning across various applications, including image and video captioning, visual question answering, and cross-modal retrieval. Despite their superior capabilities, VLMs struggle with fine-grained image regional composition information perception. Specifically, they have difficulty accurately aligning the segmentation masks with the corresponding semantics and precisely describing the compositional aspects of the referred regions. However, compositionality - the ability to understand and generate novel combinations of known visual and textual components - is critical for facilitating coherent reasoning and understanding across modalities by VLMs. To address this issue, we propose FINECAPTION, a novel VLM that can recognize arbitrary masks as referential inputs and process high-resolution images for compositional image captioning at different granularity levels. To support this endeavor, we introduce COMPOSITIONCAP, a new dataset for multi-grained region compositional image captioning, which introduces the task of compositional attribute-aware regional image captioning. Empirical results demonstrate the effectiveness of our proposed model compared to other state-of-the-art VLMs. Additionally, we analyze the capabilities of current VLMs in recognizing various visual prompts for compositional region image captioning, highlighting areas for improvement in VLM design and training.

View Paper