COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation

Xueqing Deng, Qihang Yu, Ali Athar, Chenglin Yang, Linjie Yang, Xiaojie Jin, Xiaohui Shen, Liang-Chieh Chen

2025-02-05

COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for
Fine-Grained Understanding and Generation

Summary

This paper talks about the COCONut-PanCap dataset, which was created to improve how AI models understand and describe images. It combines detailed image captions with panoptic segmentation masks, which label every part of an image, to make the descriptions more accurate and comprehensive.

What's the problem?

Existing datasets for image-to-text tasks often lack detailed and scene-comprehensive descriptions, making it harder for AI models to fully understand and generate accurate image captions. Current methods either require a lot of human effort or compromise on quality and scalability.

What's the solution?

The researchers built the COCONut-PanCap dataset by using advanced panoptic segmentation masks from the COCO dataset and adding fine-grained, region-level captions. These captions were generated with the help of AI and then refined by humans to ensure accuracy. This dataset allows AI models to better learn how to describe images in detail and connect specific parts of an image to their descriptions.

Why it matters?

This work is important because it sets a new standard for training AI models that need to understand and describe images. By providing high-quality, detailed annotations, COCONut-PanCap improves both image understanding and text-to-image generation tasks, making AI more effective in real-world applications like visual question answering and content creation.

Abstract

This paper introduces the COCONut-PanCap dataset, created to enhance panoptic segmentation and grounded image captioning. Building upon the COCO dataset with advanced COCONut panoptic masks, this dataset aims to overcome limitations in existing image-text datasets that often lack detailed, scene-comprehensive descriptions. The COCONut-PanCap dataset incorporates fine-grained, region-level captions grounded in <PRE_TAG>panoptic segmentation masks</POST_TAG>, ensuring consistency and improving the detail of generated captions. Through human-edited, densely annotated descriptions, COCONut-PanCap supports improved training of vision-language models (VLMs) for image understanding and generative models for text-to-image tasks. Experimental results demonstrate that COCONut-PanCap significantly boosts performance across understanding and generation tasks, offering complementary benefits to large-scale datasets. This dataset sets a new benchmark for evaluating models on joint panoptic segmentation and grounded captioning tasks, addressing the need for high-quality, detailed image-text annotations in multi-modal learning.

View Paper