Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Zhengfeng Lai, Vasileios Saveris, Chen Chen, Hong-You Chen, Haotian Zhang, Bowen Zhang, Juan Lao Tebar, Wenze Hu, Zhe Gan, Peter Grasch, Meng Cao, Yinfei Yang

2024-10-04

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Summary

This paper discusses the importance of using diverse and rewritten captions for training multimodal foundation models, which combine text and images, to improve their performance.

What's the problem?

While recent models have shown that rewritten captions can enhance how well they understand images and text together, there are still challenges. For instance, it's unclear if synthetic captions (computer-generated captions) can fully replace original AltTexts (descriptions used for accessibility). Additionally, different models may prefer different types of captions, but not enough research has been done to find the best caption formats for each model.

What's the solution?

To address these issues, the authors propose a new system that generates various caption formats tailored to different multimodal models. They studied two types of synthetic captions: Short Synthetic Captions (SSC) and Dense Synthetic Captions (DSC+), and examined how these interact with original AltTexts. Their research found that using both synthetic captions and AltTexts together leads to better performance than using synthetic captions alone. Each model also showed preferences for specific caption styles.

Why it matters?

This research is important because it helps improve how multimodal models learn from images and text. By optimizing captioning strategies, the findings can lead to better AI systems that understand content more accurately, which is crucial for applications in areas like image search engines, accessibility tools, and any technology that relies on interpreting visual and textual information together.

Abstract

Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. For example, while synthetic captions often provide superior quality and image-text alignment, it is not clear whether they can fully replace AltTexts: the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood. Moreover, different multimodal foundation models may have unique preferences for specific caption formats, but efforts to identify the optimal captions for each model remain limited. In this work, we propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models. By examining Short Synthetic Captions (SSC) towards Dense Synthetic Captions (DSC+) as case studies, we systematically explore their effects and interactions with AltTexts across models such as CLIP, multimodal LLMs, and diffusion models. Our findings reveal that a hybrid approach that keeps both synthetic captions and AltTexts can outperform the use of synthetic captions alone, improving both alignment and performance, with each model demonstrating preferences for particular caption formats. This comprehensive analysis provides valuable insights into optimizing captioning strategies, thereby advancing the pre-training of multimodal foundation models.

View Paper