Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization
Nikita Kachaev, Mikhail Kolosov, Daniil Zelezetsky, Alexey K. Kovalev, Aleksandr I. Panov
2025-11-05
Summary
This paper investigates what happens to the understanding of images and language within artificial intelligence models when they are taught to also perform actions. These models, called Vision-Language-Action (VLA) models, are built upon models already good at understanding images and language, but the research questions whether teaching them actions messes up their original abilities.
What's the problem?
The core issue is that when you take a model already skilled at understanding both vision (images) and language and then train it to *do* things (actions), it seems to forget some of what it originally learned about images and language. This 'forgetting' is a problem because the initial understanding of images and language is what gives the model its general knowledge and ability to adapt to new situations. The researchers wanted to figure out *how much* information is lost during this action training and *why* it happens.
What's the solution?
The researchers carefully examined the inner workings of these VLA models during training. They looked at how the model focuses on different parts of images and how it represents information internally. They also created specific tests to directly compare the VLA models to their original image-and-language counterparts, isolating the changes caused by learning actions. Finally, they tested different techniques to prevent the loss of image and language understanding while still teaching the model to perform actions, ultimately finding a simple method that worked well.
Why it matters?
This research is important because it highlights a trade-off in building more capable AI agents. You can’t just add action capabilities without considering the impact on existing knowledge. Understanding this trade-off and finding ways to preserve the original understanding of images and language is crucial for creating AI that can truly generalize and perform well in the real world, even in situations it hasn’t specifically been trained for.
Abstract
The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities. Code is publicly available: https://blind-vla-paper.github.io