MetaCLIP 2: A Worldwide Scaling Recipe

Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen-tau Yih, Shang-Wen Li, Hu Xu

2025-07-31

Summary

This paper talks about MetaCLIP 2, a new version of a popular AI model that learns to connect images and text by training on a huge amount of image-text pairs from all around the world in many languages.

What's the problem?

The problem is that previous models mostly learned from English data and struggled when trying to understand or classify images linked to non-English text, causing worse performance in other languages and cultural contexts.

What's the solution?

MetaCLIP 2 solves this by creating a special way to collect and train on a balanced mix of image-text pairs worldwide, including many languages and cultures. It uses new algorithms to handle different languages fairly and increases the model size to learn effectively from this diverse data.

Why it matters?

This matters because it helps AI models understand images and text better across the globe, not just in English. This makes AI tools more inclusive, useful, and accurate for people everywhere, supporting many applications like image search, translation, and more.

Abstract

MetaCLIP 2, trained on worldwide web-scale image-text pairs, improves zero-shot classification and multilingual benchmarks without system-level confounding factors.

View Paper