OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features

Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Rogov, Elena Tutubalina, Ivan Oseledets

2025-10-06

OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features

Summary

This paper introduces a new way to build sparse autoencoders, which are a type of artificial intelligence used to break down complex information into simpler, more understandable parts. The goal is to create features that are easily interpretable by humans.

What's the problem?

Regular sparse autoencoders sometimes run into issues where features aren't truly independent. 'Feature absorption' happens when a feature tries to do too much, essentially covering for other features and leaving gaps in the overall understanding. 'Feature composition' is the opposite – features meant to be separate end up blending together, making it hard to pinpoint what each one represents. These problems make it harder to understand what the AI is actually learning.

What's the solution?

The researchers developed something called Orthogonal Sparse Autoencoder, or OrtSAE. This method forces the features to be different from each other, like making sure each one focuses on a unique aspect of the data. They do this by penalizing the system when features become too similar during training. Importantly, this doesn't slow down the training process significantly, even as the system gets bigger.

Why it matters?

OrtSAE is better at finding truly distinct features – about 9% more than standard methods. It significantly reduces both feature absorption and composition, meaning the features are more focused and independent. This leads to better performance in tasks like removing misleading patterns in data and performs just as well as other methods on typical AI tasks. Ultimately, this makes AI models more transparent and reliable.

Abstract

Sparse autoencoders (SAEs) are a technique for sparse decomposition of neural network activations into human-interpretable features. However, current SAEs suffer from feature absorption, where specialized features capture instances of general features creating representation holes, and feature composition, where independent features merge into composite representations. In this work, we introduce Orthogonal SAE (OrtSAE), a novel approach aimed to mitigate these issues by enforcing orthogonality between the learned features. By implementing a new training procedure that penalizes high pairwise cosine similarity between SAE features, OrtSAE promotes the development of disentangled features while scaling linearly with the SAE size, avoiding significant computational overhead. We train OrtSAE across different models and layers and compare it with other methods. We find that OrtSAE discovers 9% more distinct features, reduces feature absorption (by 65%) and composition (by 15%), improves performance on spurious correlation removal (+6%), and achieves on-par performance for other downstream tasks compared to traditional SAEs.

View Paper