From Pixels to Prose: A Large Dataset of Dense Image Captions

Vasu Singla, Kaiyu Yue, Sukriti Paul, Reza Shirkavand, Mayuka Jayawardhana, Alireza Ganjdanesh, Heng Huang, Abhinav Bhatele, Gowthami Somepalli, Tom Goldstein

2024-06-18

From Pixels to Prose: A Large Dataset of Dense Image Captions

Summary

This paper introduces PixelProse, a new dataset containing over 16 million detailed captions for images. It aims to improve the training of vision-language models, which are AI systems that understand both images and text.

What's the problem?

Training large vision-language models requires a lot of high-quality image and text pairs. However, many existing datasets collected from the web are messy and don’t provide detailed descriptions of images. This lack of quality data makes it difficult for AI models to learn effectively and understand images in depth.

What's the solution?

To solve this problem, the authors created PixelProse, which consists of synthetically generated captions that provide accurate and detailed descriptions of images. They used advanced vision-language models to generate these captions and carefully checked the dataset for any harmful content, such as personal information or inappropriate material. They also included useful metadata, like whether an image has a watermark and its aesthetic quality, to help researchers filter the data further.

Why it matters?

This research is important because it provides a high-quality resource for training AI models that need to understand images and text together. By offering a comprehensive dataset like PixelProse, researchers can improve the performance of vision-language models, leading to better applications in areas such as image captioning, visual search engines, and more effective AI systems that can interpret visual information accurately.

Abstract

Training large vision-language models requires extensive, high-quality image-text pairs. Existing web-scraped datasets, however, are noisy and lack detailed image descriptions. To bridge this gap, we introduce PixelProse, a comprehensive dataset of over 16M (million) synthetically generated captions, leveraging cutting-edge vision-language models for detailed and accurate descriptions. To ensure data integrity, we rigorously analyze our dataset for problematic content, including child sexual abuse material (CSAM), personally identifiable information (PII), and toxicity. We also provide valuable metadata such as watermark presence and aesthetic scores, aiding in further dataset filtering. We hope PixelProse will be a valuable resource for future vision-language research. PixelProse is available at https://huggingface.co/datasets/tomg-group-umd/pixelprose

View Paper