Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
Yu-Guan Hsieh, Cheng-Yu Hsieh, Shih-Ying Yeh, Louis Béthune, Hadi Pour Ansari, Pavan Kumar Anasosalu Vasu, Chun-Liang Li, Ranjay Krishna, Oncel Tuzel, Marco Cuturi
2024-07-10

Summary
This paper talks about a new method called graph-based captioning (GBC) that improves how we describe images by using a graph structure. Instead of just giving a simple text description, GBC connects different parts of the image in a way that reflects their relationships and compositions.
What's the problem?
The main problem is that existing methods for describing images often use plain text, which can limit the richness and detail of the descriptions. Current datasets do not effectively capture the complex relationships between different objects in an image, making it hard for AI models to understand and generate detailed visual descriptions.
What's the solution?
To solve this issue, the authors propose GBC, which organizes image descriptions into a graph format. This involves creating nodes for different entities (like objects) and linking them with edges that represent their relationships and compositions. The GBC method uses tools for object detection and dense captioning to automatically generate these graphs for a large dataset of about 10 million images. By doing this, GBC retains the flexibility of natural language while also providing structured information that helps AI models understand complex scenes better.
Why it matters?
This research is important because it enhances the way AI systems interpret and generate descriptions of images. By using a graph-based approach, GBC allows for more detailed and interconnected descriptions, which can improve the performance of various AI applications like image retrieval, visual understanding, and even text-to-image generation. This advancement can lead to better tools for education, accessibility, and creative industries.
Abstract
Humans describe complex scenes with compositionality, using simple text descriptions enriched with links and relationships. While vision-language research has aimed to develop models with compositional understanding capabilities, this is not reflected yet in existing datasets which, for the most part, still use plain text to describe images. In this work, we propose a new annotation strategy, graph-based captioning (GBC) that describes an image using a labelled graph structure, with nodes of various types. The nodes in GBC are created using, in a first stage, object detection and dense captioning tools nested recursively to uncover and describe entity nodes, further linked together in a second stage by highlighting, using new types of nodes, compositions and relations among entities. Since all GBC nodes hold plain text descriptions, GBC retains the flexibility found in natural language, but can also encode hierarchical information in its edges. We demonstrate that GBC can be produced automatically, using off-the-shelf multimodal LLMs and open-vocabulary detection models, by building a new dataset, GBC10M, gathering GBC annotations for about 10M images of the CC12M dataset. We use GBC10M to showcase the wealth of node captions uncovered by GBC, as measured with CLIP training. We show that using GBC nodes' annotations -- notably those stored in composition and relation nodes -- results in significant performance boost on downstream models when compared to other dataset formats. To further explore the opportunities provided by GBC, we also propose a new attention mechanism that can leverage the entire GBC graph, with encouraging experimental results that show the extra benefits of incorporating the graph structure. Our datasets are released at https://huggingface.co/graph-based-captions.