Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

Qirui Jiao, Daoyuan Chen, Yilun Huang, Yaliang Li, Ying Shen

2024-08-09

Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

Summary

This paper introduces Img-Diff, a new dataset designed to improve how machines recognize and understand images by focusing on the differences between similar images.

What's the problem?

Large Multimodal Language Models (MLLMs) need high-quality data to learn how to recognize images accurately. However, existing datasets often lack the detail needed for fine-grained image recognition, making it hard for these models to distinguish between similar objects in different images. This can limit their effectiveness in tasks like visual question answering and image analysis.

What's the solution?

The authors created Img-Diff, which consists of pairs of similar images that highlight specific object differences. They used advanced techniques to generate these images and provide detailed captions explaining the differences. By training MLLMs on this dataset, they found significant improvements in the models' ability to recognize subtle variations in images compared to using larger but less focused datasets. The dataset also includes methods for generating additional types of image difference data to further enhance its usefulness.

Why it matters?

This research is important because it provides a valuable resource for improving machine learning models that need to understand images better. By focusing on the differences between similar objects, Img-Diff helps advance the capabilities of AI in areas like computer vision, which can lead to better applications in fields such as healthcare, autonomous driving, and robotics.

Abstract

High-performance Multimodal Large Language Models (MLLMs) rely heavily on data quality. This study introduces a novel dataset named Img-Diff, designed to enhance fine-grained image recognition in MLLMs by leveraging insights from contrastive learning and image difference captioning. By analyzing object differences between similar images, we challenge models to identify both matching and distinct components. We utilize the Stable-Diffusion-XL model and advanced image editing techniques to create pairs of similar images that highlight object replacements. Our methodology includes a Difference Area Generator for object differences identifying, followed by a Difference Captions Generator for detailed difference descriptions. The result is a relatively small but high-quality dataset of "object replacement" samples. We use the the proposed dataset to fine-tune state-of-the-art (SOTA) MLLMs such as MGM-7B, yielding comprehensive improvements of performance scores over SOTA models that trained with larger-scale datasets, in numerous image difference and Visual Question Answering tasks. For instance, our trained models notably surpass the SOTA models GPT-4V and Gemini on the MMVP benchmark. Besides, we investigate alternative methods for generating image difference data through "object removal" and conduct thorough evaluation to confirm the dataset's diversity, quality, and robustness, presenting several insights on synthesis of such contrastive dataset. To encourage further research and advance the field of multimodal data synthesis and enhancement of MLLMs' fundamental capabilities for image understanding, we release our codes and dataset at https://github.com/modelscope/data-juicer/tree/ImgDiff.

View Paper