Adapting Vision-Language Models Without Labels: A Comprehensive Survey

Hao Dong, Lijun Sheng, Jian Liang, Ran He, Eleni Chatzi, Olga Fink

2025-08-11

Adapting Vision-Language Models Without Labels: A Comprehensive Survey

Summary

This paper talks about different ways to make Vision-Language Models (VLMs), which understand both images and text, work better in new situations without needing any labeled data. It reviews many existing methods and groups them based on how much unlabeled image data is available.

What's the problem?

The problem is that while these models are very good at general tasks, they often don't perform their best when applied to new, specific problems unless they are adapted. Usually, adapting models requires labeled data, which is expensive and hard to get, so figuring out how to adapt without labels is challenging.

What's the solution?

The paper organizes and explains a variety of methods for unsupervised adaptation, which means improving the model using only unlabeled images. It categorizes approaches into four types depending on the data available: no data at all, lots of unlabeled data, small batches at test time, and continuous adaptation with streaming data. It discusses how these methods work, evaluates their performance, and suggests future research directions.

Why it matters?

This matters because adapting Vision-Language Models without labeled data makes it easier and cheaper to use them in real-world tasks, where labeled examples may not be available. This can help bring powerful AI tools to more applications and environments faster and with less effort.

Abstract

A comprehensive survey of unsupervised adaptation methods for Vision-Language Models (VLMs) categorizes approaches based on the availability of unlabeled visual data and discusses methodologies, benchmarks, and future research directions.

View Paper