CompCap: Improving Multimodal Large Language Models with Composite Captions
Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab, Aashu Singh, Qifan Wang, David Yang, ShengYun Peng, Hanchao Yu, Shen Yan, Xuewen Zhang, Baosheng He
2024-12-09

Summary
This paper talks about CompCap, a new approach to improve how Multimodal Large Language Models (MLLMs) understand and generate captions for composite images, which are made by combining multiple visual elements.
What's the problem?
While MLLMs have made progress in interpreting natural images, they struggle with composite images (CIs), like charts or posters, which are common in real-world applications. Current models often fail to extract accurate information or perform complex reasoning with these types of images because the existing training data is mostly focused on simpler question-answer tasks and lacks high-quality captions for CIs.
What's the solution?
To address this issue, the authors introduce Composite Captions (CompCap), a framework that uses Large Language Models and automation tools to create a large dataset of composite images paired with detailed captions. They developed a dataset called CompCap-118K, which contains 118,000 image-caption pairs across six different types of composite images. The authors then fine-tuned various MLLMs using this dataset to improve their ability to understand and generate captions for composite images.
Why it matters?
This research is important because it enhances the capabilities of AI in understanding complex visuals that combine multiple elements. By improving how MLLMs handle composite images, CompCap can lead to better applications in fields like education, data visualization, and content creation, making it easier for machines to interpret and describe complex information visually.
Abstract
How well can Multimodal Large Language Models (MLLMs) understand composite images? Composite images (CIs) are synthetic visuals created by merging multiple visual elements, such as charts, posters, or screenshots, rather than being captured directly by a camera. While CIs are prevalent in real-world applications, recent MLLM developments have primarily focused on interpreting natural images (NIs). Our research reveals that current MLLMs face significant challenges in accurately understanding CIs, often struggling to extract information or perform complex reasoning based on these images. We find that existing training data for CIs are mostly formatted for question-answer tasks (e.g., in datasets like ChartQA and ScienceQA), while high-quality image-caption datasets, critical for robust vision-language alignment, are only available for NIs. To bridge this gap, we introduce Composite Captions (CompCap), a flexible framework that leverages Large Language Models (LLMs) and automation tools to synthesize CIs with accurate and detailed captions. Using CompCap, we curate CompCap-118K, a dataset containing 118K image-caption pairs across six CI types. We validate the effectiveness of CompCap-118K by supervised fine-tuning MLLMs of three sizes: xGen-MM-inst.-4B and LLaVA-NeXT-Vicuna-7B/13B. Empirical results show that CompCap-118K significantly enhances MLLMs' understanding of CIs, yielding average gains of 1.7%, 2.0%, and 2.9% across eleven benchmarks, respectively.