SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, Xiaohua Zhai

2025-02-21

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic
Understanding, Localization, and Dense Features

Summary

This paper talks about SigLIP 2, a new and improved version of an AI system that can understand both images and text in multiple languages. It's like creating a super-smart digital assistant that can see, read, and understand information from around the world better than ever before.

What's the problem?

The original SigLIP model was good, but it had some limitations. It wasn't as good at understanding the details in images, figuring out where things are in pictures, or working with different types of information across many languages. It's like having a smart friend who can describe pictures, but sometimes misses important details or gets confused when looking at complex scenes.

What's the solution?

The researchers created SigLIP 2 by combining several clever techniques into one system. They taught it to write captions for images, learn from its own mistakes, and work with a wider variety of data from different cultures and languages. They also made different versions of SigLIP 2, from smaller, faster ones to bigger, more powerful ones, so it can be used in different situations. It's like giving that smart friend special training to notice more details, understand context better, and learn from many more examples from around the world.

Why it matters?

This matters because it brings us closer to having AI that can truly understand and communicate about visual information in many languages. This could help break down language barriers, make technology more accessible worldwide, and improve things like image search, automatic captioning for visually impaired people, and even help robots understand their surroundings better. It's a big step towards making AI that can see and understand the world more like humans do, but across many cultures and languages.

Abstract

We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).

View Paper