Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, Neel Nanda

2024-08-12

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Summary

This paper introduces Gemma Scope, an open-source suite of sparse autoencoders designed to help researchers understand and interpret the inner workings of the Gemma 2 language models.

What's the problem?

Understanding how complex AI models like Gemma 2 make decisions is challenging. Traditional methods for analyzing these models often focus on smaller models or single layers, which limits our ability to grasp the full complexity of larger systems. This lack of insight can hinder efforts to improve AI safety and reliability, especially when it comes to issues like biases or errors in the model's responses.

What's the solution?

Gemma Scope provides a comprehensive set of sparse autoencoders that analyze every layer and sub-layer of the Gemma 2 models. These autoencoders act like a 'microscope' that allows researchers to see how features within the model evolve and interact. By training these autoencoders on various layers, the authors enable a deeper understanding of how the model processes information and makes predictions. They also release the weights and tools needed for others to use this suite, promoting further research in AI interpretability.

Why it matters?

This research is important because it opens up new avenues for understanding complex AI systems, making it easier for researchers to identify and address potential issues. By providing tools for better interpretability, Gemma Scope aims to enhance the safety and effectiveness of AI technologies, which is crucial as these systems become more integrated into everyday life.

Abstract

Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse decomposition of a neural network's latent representations into seemingly interpretable features. Despite recent excitement about their potential, research applications outside of industry are limited by the high cost of training a comprehensive suite of SAEs. In this work, we introduce Gemma Scope, an open suite of JumpReLU SAEs trained on all layers and sub-layers of Gemma 2 2B and 9B and select layers of Gemma 2 27B base models. We primarily train SAEs on the Gemma 2 pre-trained models, but additionally release SAEs trained on instruction-tuned Gemma 2 9B for comparison. We evaluate the quality of each SAE on standard metrics and release these results. We hope that by releasing these SAE weights, we can help make more ambitious safety and interpretability research easier for the community. Weights and a tutorial can be found at https://huggingface.co/google/gemma-scope and an interactive demo can be found at https://www.neuronpedia.org/gemma-scope

View Paper