Scaling Pre-training to One Hundred Billion Data for Vision Language Models

Xiao Wang, Ibrahim Alabdulmohsin, Daniel Salz, Zhe Li, Keran Rong, Xiaohua Zhai

2025-02-12

Scaling Pre-training to One Hundred Billion Data for Vision Language
Models

Summary

This paper talks about training AI models that understand both images and text using an enormous dataset of 100 billion examples. The researchers wanted to see if using such a huge amount of data would make the AI better at various tasks, especially those involving different cultures and languages.

What's the problem?

Current AI models that work with images and text are mostly trained on data from Western countries, which means they might not understand or represent other cultures and languages very well. Also, researchers weren't sure if using more data would actually make these AIs better at their tasks.

What's the solution?

The researchers created a massive dataset with 100 billion image-text pairs from the internet. They trained AI models on this data and tested them on various tasks. They found that while the AI didn't get much better at common Western-focused tasks, it improved a lot on tasks involving cultural diversity and less common languages. They also discovered that filtering the data for quality, which is usually thought to be helpful, can actually reduce cultural diversity in the AI's knowledge.

Why it matters?

This research matters because it shows that to make AI systems that truly understand and represent the whole world, we need to use huge amounts of diverse data. It highlights that just making datasets bigger isn't enough; we need to make sure they include a wide range of cultures and languages. This could lead to AI that's more fair and useful for people from all backgrounds, not just those from Western countries.

Abstract

We provide an empirical investigation of the potential of pre-training vision-language models on an unprecedented scale: 100 billion examples. We find that model performance tends to saturate at this scale on many common Western-centric classification and retrieval benchmarks, such as COCO Captions. Nevertheless, tasks of cultural diversity achieve more substantial gains from the 100-billion scale web data, thanks to its coverage of long-tail concepts. Furthermore, we analyze the model's multilinguality and show gains in low-resource languages as well. In addition, we observe that reducing the size of the pretraining dataset via quality filters like using CLIP, typically used to enhance performance, may inadvertently reduce the cultural diversity represented even in large-scale datasets. Our results highlight that while traditional benchmarks may not benefit significantly from scaling noisy, raw web data to 100 billion examples, this data scale is vital for building truly inclusive multimodal systems.

View Paper