Mechanistic interpretability for steering vision-language-action models

Bear Häon, Kaylene Stocking, Ian Chuang, Claire Tomlin

2025-09-09

Mechanistic interpretability for steering vision-language-action models

Summary

This paper explores how to understand and control Vision-Language-Action (VLA) models, which are AI systems designed to perform tasks in the real world based on visual input, language instructions, and actions. The research focuses on making these models more transparent and easier to direct.

What's the problem?

Currently, VLA models are like 'black boxes'. We know they can *do* things, but it's hard to figure out *why* they do them. Traditional robotics relies on detailed understanding of how robots move and interact with the world, but VLAs lack this clear internal logic. This makes it risky to use them in real-world situations where safety and predictability are important, because we can't easily fix problems or explain their behavior.

What's the solution?

The researchers developed a method to peek inside the VLA models and identify specific parts of their internal workings that control certain actions, like speed or direction. They found that these controls are surprisingly simple and can be adjusted directly without needing to retrain the model or give it new examples. This 'steering' method allows them to change the robot's behavior in real-time, just by tweaking these internal settings.

Why it matters?

This work is a big step towards creating more reliable and understandable robots. By making VLAs more transparent, we can build robots that are safer, more predictable, and easier to adapt to new tasks. It opens the door to a new way of building robotic systems – using powerful AI models as a foundation, but with the ability to directly control and understand their actions.

Abstract

Vision-Language-Action (VLA) models are a promising path to realizing generalist embodied agents that can quickly adapt to new tasks, modalities, and environments. However, methods for interpreting and steering VLAs fall far short of classical robotics pipelines, which are grounded in explicit models of kinematics, dynamics, and control. This lack of mechanistic insight is a central challenge for deploying learned policies in real-world robotics, where robustness and explainability are critical. Motivated by advances in mechanistic interpretability for large language models, we introduce the first framework for interpreting and steering VLAs via their internal representations, enabling direct intervention in model behavior at inference time. We project feedforward activations within transformer layers onto the token embedding basis, identifying sparse semantic directions - such as speed and direction - that are causally linked to action selection. Leveraging these findings, we introduce a general-purpose activation steering method that modulates behavior in real time, without fine-tuning, reward signals, or environment interaction. We evaluate this method on two recent open-source VLAs, Pi0 and OpenVLA, and demonstrate zero-shot behavioral control in simulation (LIBERO) and on a physical robot (UR5). This work demonstrates that interpretable components of embodied VLAs can be systematically harnessed for control - establishing a new paradigm for transparent and steerable foundation models in robotics.

View Paper