Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models
Xiao Xu, Tianhao Niu, Yuxi Xie, Libo Qin, Wanxiang Che, Min-Yen Kan
2024-12-10

Summary
This paper talks about a new dataset and approach called MMGiC that improves how Multimodal Large Language Models (MLLMs) understand and generate information by using both coarse-grained and fine-grained concept annotations.
What's the problem?
Current MLLMs primarily rely on simple annotations, like image captions, which may not provide enough detail for complex tasks. This lack of detailed information can limit the models' ability to understand and generate content accurately, especially when dealing with diverse objects and contexts in images.
What's the solution?
The authors introduce the MMGiC dataset, which includes both coarse-grained annotations (like captions) and fine-grained annotations (like specific object labels and their locations). They explore how these different types of data can work together to enhance the performance of MLLMs. Their experiments show that combining these annotations leads to better understanding and generation of multimodal content, achieving significant improvements in various benchmarks.
Why it matters?
This research is important because it helps improve AI's ability to process and understand complex visual information. By using a more detailed approach to annotations, MMGiC can lead to advancements in applications such as image recognition, automated content creation, and more effective human-computer interactions. This ultimately enhances the overall performance of AI systems in real-world tasks.
Abstract
Multimodal Large Language Models (MLLMs) excel in vision--language tasks by pre-training solely on coarse-grained concept annotations (e.g., image captions). We hypothesize that integrating fine-grained concept annotations (e.g., object labels and object regions) will further improve performance, as both data granularities complement each other in terms of breadth and depth in concept representation. We introduce a new dataset featuring Multimodal Multi-Grained Concept annotations (MMGiC) for MLLMs. In constructing MMGiC, we explore the impact of different data recipes on multimodal comprehension and generation. Our analyses reveal that multi-grained concept annotations integrate and complement each other, under our structured template and a general MLLM framework. We clearly explore and demonstrate the potential of MMGiC to help MLLMs better locate and learn concepts, aligning vision and language at multiple granularities. We further validate our hypothesis by investigating the fair comparison and effective collaboration between MMGiC and image--caption data on 12 multimodal comprehension and generation benchmarks, e.g., their appropriate combination achieve 3.95% and 2.34% absolute improvements over image--caption data alone on POPE and SEED-Bench. Code, data and models will be available at https://github.com/LooperXX/MMGiC.