Test-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models

Konstantinos M. Dafnis, Dimitris N. Metaxas

2025-11-18

Test-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models

Summary

This paper introduces a new method called Spectrum-Aware Test-Time Steering, or STS, which helps vision-language models maintain their accuracy when faced with images that are different from the ones they were originally trained on.

What's the problem?

Vision-language models are really good at understanding images and text together, even when asked to do things they haven't specifically been trained for. However, their performance drops when they encounter images that look different from their training data – think of a change in lighting or style. Existing methods to fix this problem often require a lot of computational power and memory because they need to adjust the core parts of the model, making them slow and resource-intensive.

What's the solution?

STS tackles this problem by making small, quick adjustments to how the model *interprets* the image and text, rather than changing the model itself. It identifies the most important aspects of the text's meaning and then subtly shifts the image's representation to align with those aspects. This 'steering' happens in a hidden space within the model, without needing to retrain any of the original components or process the entire model. It only adapts a small number of settings for each image, making it very efficient.

Why it matters?

This research is important because it provides a way to make vision-language models more reliable in real-world situations where images can vary greatly. STS is faster and uses much less memory than other methods, meaning it can be used on devices with limited resources and can process images more quickly. This opens the door to more practical applications of these powerful models.

Abstract

Vision-Language Models (VLMs) excel at zero-shot inference but often degrade under test-time domain shifts. For this reason, episodic test-time adaptation strategies have recently emerged as powerful techniques for adapting VLMs to a single unlabeled image. However, existing adaptation strategies, such as test-time prompt tuning, typically require backpropagating through large encoder weights or altering core model components. In this work, we introduce Spectrum-Aware Test-Time Steering (STS), a lightweight adaptation framework that extracts a spectral subspace from the textual embeddings to define principal semantic directions and learns to steer latent representations in a spectrum-aware manner by adapting a small number of per-sample shift parameters to minimize entropy across augmented views. STS operates entirely at inference in the latent space, without backpropagation through or modification of the frozen encoders. Building on standard evaluation protocols, our comprehensive experiments demonstrate that STS largely surpasses or compares favorably against state-of-the-art test-time adaptation methods, while introducing only a handful of additional parameters and achieving inference speeds up to 8x faster with a 12x smaller memory footprint than conventional test-time prompt tuning. The code is available at https://github.com/kdafnis/STS.

View Paper