Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

Nick Jiang, Xiaoqing Sun, Lisa Dunlap, Lewis Smith, Neel Nanda

2025-12-15

Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

Summary

This paper introduces a new way to analyze large amounts of text data, focusing on understanding what's *in* the data and how models react to it, especially when looking for potential problems like biases.

What's the problem?

Currently, analyzing huge collections of text is difficult and expensive. Existing methods either use powerful but costly AI models (like large language models) to understand the text, or they create simplified representations that lose important details and don't allow for focused investigation of specific topics. It's hard to efficiently find differences between datasets or understand what concepts are linked together in a meaningful way.

What's the solution?

The researchers propose using something called 'sparse autoencoders' (SAEs) to create a special type of text representation called 'SAE embeddings'. These embeddings are different because each part of the representation corresponds to a clear, understandable concept. They tested these embeddings on several tasks, showing they are cheaper and more reliable than using large language models, and more focused and controllable than simpler methods. They used SAEs to compare how different AI models respond to questions, find hidden biases, and group documents based on specific themes.

Why it matters?

This work is important because it provides a more practical and insightful way to analyze the massive datasets used to train AI models. By understanding the data better, we can identify and address issues like biases and unexpected behaviors, ultimately leading to more trustworthy and reliable AI systems. It highlights that looking *at* the data itself is crucial for understanding how AI models work.

Abstract

Analyzing large-scale text corpora is a core challenge in machine learning, crucial for tasks like identifying undesirable model behaviors or biases in training data. Current methods often rely on costly LLM-based techniques (e.g. annotating dataset differences) or dense embedding models (e.g. for clustering), which lack control over the properties of interest. We propose using sparse autoencoders (SAEs) to create SAE embeddings: representations whose dimensions map to interpretable concepts. Through four data analysis tasks, we show that SAE embeddings are more cost-effective and reliable than LLMs and more controllable than dense embeddings. Using the large hypothesis space of SAEs, we can uncover insights such as (1) semantic differences between datasets and (2) unexpected concept correlations in documents. For instance, by comparing model responses, we find that Grok-4 clarifies ambiguities more often than nine other frontier models. Relative to LLMs, SAE embeddings uncover bigger differences at 2-8x lower cost and identify biases more reliably. Additionally, SAE embeddings are controllable: by filtering concepts, we can (3) cluster documents along axes of interest and (4) outperform dense embeddings on property-based retrieval. Using SAE embeddings, we study model behavior with two case studies: investigating how OpenAI model behavior has changed over time and finding "trigger" phrases learned by Tulu-3 (Lambert et al., 2024) from its training data. These results position SAEs as a versatile tool for unstructured data analysis and highlight the neglected importance of interpreting models through their data.

View Paper