Cluster and Predict Latents Patches for Improved Masked Image Modeling

Timothée Darcet, Federico Baldassarre, Maxime Oquab, Julien Mairal, Piotr Bojanowski

2025-02-17

Cluster and Predict Latents Patches for Improved Masked Image Modeling

Summary

This paper talks about CAPI, a new way to teach AI to understand images by itself, without needing humans to label every picture. It's like giving the AI a puzzle where some pieces are missing, and it has to figure out what should go in those spaces.

What's the problem?

Current methods for teaching AI to understand images on its own (called Masked Image Modeling or MIM) aren't as good as methods that use human-labeled data. Scientists want to make MIM better so AI can learn more efficiently from unlabeled images.

What's the solution?

The researchers created CAPI, which works by grouping similar parts of images together and then predicting what's in the missing spaces based on these groups. They carefully chose how to represent image parts, what kind of math to use to measure how well the AI is doing, and how to build the AI system itself. CAPI turned out to be really good at learning from images and was stable and easy to train.

Why it matters?

This matters because it could make AI much better at understanding images without needing humans to spend time labeling everything. CAPI performed really well on tests for recognizing objects in images and understanding complex scenes, almost as good as the best current methods that use labeled data. This could lead to AI that can learn more efficiently from the huge amount of unlabeled images available online, potentially improving things like image search, self-driving cars, and medical image analysis.

Abstract

Masked Image Modeling (MIM) offers a promising approach to self-supervised representation learning, however existing MIM models still lag behind the state-of-the-art. In this paper, we systematically analyze target representations, loss functions, and architectures, to introduce CAPI - a novel pure-MIM framework that relies on the prediction of latent clusterings. Our approach leverages a clustering-based loss, which is stable to train, and exhibits promising scaling properties. Our ViT-L backbone, CAPI, achieves 83.8% accuracy on ImageNet and 32.1% mIoU on ADE20K with simple linear probes, substantially outperforming previous MIM methods and approaching the performance of the current state-of-the-art, DINOv2. We release all our code and models.

View Paper