Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?
Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Rogov, Ivan Oseledets, Elena Tutubalina
2026-02-18
Summary
This paper investigates whether Sparse Autoencoders (SAEs) actually work as intended, which is to find meaningful, understandable features inside complex neural networks.
What's the problem?
SAEs are becoming popular for trying to understand what's happening inside 'black box' AI models, but recent studies have shown they don't always improve performance on tasks that test whether they've actually learned something useful. The core question is: do SAEs really identify the important parts of a neural network, or are they just finding patterns that *look* good but don't actually represent how the network functions?
What's the solution?
The researchers tackled this in two ways. First, they created a simplified, artificial neural network where they *knew* what the important features were, and then used SAEs to see if they could find them. They found that SAEs only identified a small percentage of the true features, even though they were good at reconstructing the original data. Second, they created simpler 'fake' SAEs that didn't actually learn anything, but just randomly generated features. Surprisingly, these fake SAEs performed just as well as real SAEs on tests of interpretability, probing, and editing the network, suggesting that SAEs aren't actually doing much better than random chance.
Why it matters?
This research is important because it challenges the idea that SAEs are a reliable tool for understanding AI. If SAEs aren't actually revealing meaningful features, then we need to be careful about trusting their results and potentially look for other methods to interpret these complex models. It suggests that simply getting good reconstruction accuracy doesn't guarantee that an SAE has discovered something truly important about how a neural network works.
Abstract
Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features. To directly investigate this, we perform two complementary evaluations. On a synthetic setup with known ground-truth features, we demonstrate that SAEs recover only 9% of true features despite achieving 71% explained variance, showing that they fail at their core task even when reconstruction is strong. To evaluate SAEs on real activations, we introduce three baselines that constrain SAE feature directions or their activation patterns to random values. Through extensive experiments across multiple SAE architectures, we show that our baselines match fully-trained SAEs in interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0.72), and causal editing (0.73 vs 0.72). Together, these results suggest that SAEs in their current state do not reliably decompose models' internal mechanisms.