Mechanistic Permutability: Match Features Across Layers
Nikita Balagansky, Ian Maksimov, Daniil Gavrilov
2024-10-14

Summary
This paper introduces SAE Match, a new method for aligning features extracted from different layers of deep neural networks to better understand how these features evolve.
What's the problem?
In deep learning, understanding how features (the important patterns that models learn) change as they move through layers of a neural network is challenging. Existing methods struggle because features can mean different things at different layers (polysemanticity) and can overlap in complex ways (feature superposition). This makes it hard to compare and align features across layers.
What's the solution?
SAE Match addresses this problem by using Sparse Autoencoders (SAEs) to extract features from each layer and then aligning these features across layers. The method works by minimizing the differences between the features of adjacent layers, taking into account how the scales of these features might differ. This allows for a clearer understanding of how features develop and change as they pass through the network.
Why it matters?
This research is important because it advances our understanding of how deep neural networks operate. By providing a way to align and interpret features across layers, SAE Match can help researchers and developers improve model designs and enhance mechanistic interpretability, making AI systems more transparent and reliable.
Abstract
Understanding how features evolve across layers in deep neural networks is a fundamental challenge in mechanistic interpretability, particularly due to polysemanticity and feature superposition. While Sparse Autoencoders (SAEs) have been used to extract interpretable features from individual layers, aligning these features across layers has remained an open problem. In this paper, we introduce SAE Match, a novel, data-free method for aligning SAE features across different layers of a neural network. Our approach involves matching features by minimizing the mean squared error between the folded parameters of SAEs, a technique that incorporates activation thresholds into the encoder and decoder weights to account for differences in feature scales. Through extensive experiments on the Gemma 2 language model, we demonstrate that our method effectively captures feature evolution across layers, improving feature matching quality. We also show that features persist over several layers and that our approach can approximate hidden states across layers. Our work advances the understanding of feature dynamics in neural networks and provides a new tool for mechanistic interpretability studies.