Singular Value Few-shot Adaptation of Vision-Language Models

Taha Koleilat, Hassan Rivaz, Yiming Xiao

2025-09-09

Singular Value Few-shot Adaptation of Vision-Language Models

Summary

This paper introduces a new method called CLIP-SVD for adapting powerful vision-language models, like CLIP, to specific tasks without completely retraining them, which is expensive and can sometimes make the model forget what it already learned.

What's the problem?

Vision-language models are great at many things, but making them really good at *specific* tasks, especially those with lots of detail, is hard. Usually, you need to carefully craft prompts or add extra parts to the model, but these methods don't always work well and can even mess up the model's existing knowledge. Fully retraining the model is costly and time-consuming.

What's the solution?

CLIP-SVD solves this by making tiny, targeted changes to the model's internal settings using a mathematical technique called Singular Value Decomposition (SVD). Instead of adding new components, it subtly adjusts the existing ones, only changing about 0.04% of the model's total parameters. This allows the model to learn the new task without losing its general abilities.

Why it matters?

This research is important because it provides a way to efficiently adapt these large models to new areas, like medical imaging, without needing huge amounts of computing power or risking damage to the model's overall performance. It achieves better results than previous methods and offers a way to understand *how* the model is adapting, making it more trustworthy and useful.

Abstract

Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present CLIP-SVD, a novel multi-modal and parameter-efficient adaptation technique that leverages Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only 0.04\% of the model's total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. The code is publicly available at https://github.com/HealthX-Lab/CLIP-SVD.

View Paper