Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov
2025-02-07
Summary
This paper talks about a new method to understand how large language models process information through their layers. It uses a technique to track and map how specific features, like concepts or patterns, change as they move through the model, and it also shows how these features can be adjusted to control the model's output.
What's the problem?
Large language models are very powerful but are often seen as 'black boxes' because it's hard to understand how they process information internally. This lack of understanding makes it difficult to interpret their decisions or guide their behavior in a transparent way.
What's the solution?
The researchers developed a method that uses sparse autoencoders and cosine similarity to create detailed maps of how features evolve across the layers of a language model. These maps, called flow graphs, show whether features persist, transform, or disappear at each stage. By analyzing these graphs, they were able to amplify or suppress certain features to steer the model's output in specific directions, like focusing on certain themes during text generation.
Why it matters?
This research is important because it helps make AI models more understandable and controllable. By revealing how information flows through the model and allowing targeted adjustments, this method could lead to more transparent and reliable AI systems, which is crucial for building trust and ensuring ethical use.
Abstract
We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.