VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

Shufan Shen, Junshu Sun, Qingming Huang, Shuhui Wang

2025-10-29

VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

Summary

This paper introduces a new method, VL-SAE, to understand how vision-language models (VLMs) connect images and text. These models are good at reasoning about both types of information, but it's been hard to figure out *what* specifically makes those connections work.

What's the problem?

Current VLMs are really powerful, but they're like a 'black box'. We can see what they *do*, but not *how* they do it. Specifically, it's difficult to pinpoint what concepts or ideas the model is using when it links a picture to a description. There isn't a clear way to translate the model's internal workings into something humans can easily understand, like specific objects or actions.

What's the solution?

The researchers created VL-SAE, which is a special type of neural network called a sparse autoencoder. Think of it like a translator. It takes the information from both images and text, and then organizes it into a set of 'neurons'. Each neuron represents a specific concept – something the model has learned, like 'cat' or 'running'. The key is that similar images and texts activate the *same* neurons, allowing us to see what concepts the model associates with each other. They trained this network to make sure that if two images or texts have similar meanings, the same neurons 'fire' when processing them.

Why it matters?

This work is important because it helps us understand and improve VLMs. By figuring out what concepts these models are using, we can check if they're making sense and identify potential biases. Also, by strengthening the connections between images and text at the concept level, the model becomes more accurate at tasks like identifying objects in images and avoiding 'hallucinations' – making up details that aren't actually there.

Abstract

The alignment of vision-language representations endows current Vision-Language Models (VLMs) with strong multi-modal reasoning capabilities. However, the interpretability of the alignment component remains uninvestigated due to the difficulty in mapping the semantics of multi-modal representations into a unified concept set. To address this problem, we propose VL-SAE, a sparse autoencoder that encodes vision-language representations into its hidden activations. Each neuron in its hidden layer correlates to a concept represented by semantically similar images and texts, thereby interpreting these representations with a unified concept set. To establish the neuron-concept correlation, we encourage semantically similar representations to exhibit consistent neuron activations during self-supervised training. First, to measure the semantic similarity of multi-modal representations, we perform their alignment in an explicit form based on cosine similarity. Second, we construct the VL-SAE with a distance-based encoder and two modality-specific decoders to ensure the activation consistency of semantically similar representations. Experiments across multiple VLMs (e.g., CLIP, LLaVA) demonstrate the superior capability of VL-SAE in interpreting and enhancing the vision-language alignment. For interpretation, the alignment between vision and language representations can be understood by comparing their semantics with concepts. For enhancement, the alignment can be strengthened by aligning vision-language representations at the concept level, contributing to performance improvements in downstream tasks, including zero-shot image classification and hallucination elimination. Codes are available at https://github.com/ssfgunner/VL-SAE.

View Paper