Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models

Samuel Stevens, Wei-Lun Chao, Tanya Berger-Wolf, Yu Su

2025-02-12

Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision
Models

Summary

This paper talks about a new way to understand and control how AI systems that process images and videos work. The researchers created a tool called sparse autoencoders (SAEs) that helps us see what these AI models are actually learning and allows us to change their behavior in specific ways.

What's the problem?

Current methods for understanding AI vision models have two main issues. Some methods can show us what the AI is looking at, but we can't test if these features actually cause the AI's decisions. Other methods let us change the AI's behavior, but we don't really understand what we're changing. It's like having a car where you can either see the engine or control it, but not both at the same time.

What's the solution?

The researchers developed a framework using sparse autoencoders that solves both problems. SAEs can identify specific visual features that the AI is using, like recognizing eyes or wheels, in a way that humans can understand. At the same time, they allow researchers to precisely change these features to see how it affects the AI's behavior. They tested this on different types of AI vision models and found that it works well across various tasks without needing to retrain the entire model.

Why it matters?

This matters because as AI becomes more common in our lives, we need to understand how it makes decisions, especially for important tasks like medical diagnosis or self-driving cars. This new method gives us a powerful tool to both understand and control AI vision systems, which could help make them safer, more reliable, and more trustworthy. It's a step towards creating AI that we can truly understand and adjust when needed, rather than just accepting its decisions without knowing why they were made.

Abstract

To truly understand vision models, we must not only interpret their learned features but also validate these interpretations through controlled experiments. Current approaches either provide interpretable features without the ability to test their causal influence, or enable model editing without interpretable controls. We present a unified framework using sparse autoencoders (SAEs) that bridges this gap, allowing us to discover human-interpretable visual features and precisely manipulate them to test hypotheses about model behavior. By applying our method to state-of-the-art vision models, we reveal key differences in the semantic abstractions learned by models with different pre-training objectives. We then demonstrate the practical usage of our framework through controlled interventions across multiple vision tasks. We show that SAEs can reliably identify and manipulate interpretable visual features without model re-training, providing a powerful tool for understanding and controlling vision model behavior. We provide code, demos and models on our project website: https://osu-nlp-group.github.io/SAE-V.

View Paper