FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Dataset Dependencies
Seonglae Cho, Harryn Oh, Donghyun Lee, Luis Eduardo Rodrigues Vieira, Andrew Bermingham, Ziad El Sayed
2025-06-24
Summary
This paper talks about FaithfulSAE, a method that improves sparse autoencoders—tools that help machines learn and represent important features in a simpler way—by training them on data created by the model itself rather than relying on outside data.
What's the problem?
The problem is that sparse autoencoders can become unstable and sometimes learn fake or irrelevant features, especially if they rely on external datasets that don't match well with the task.
What's the solution?
The researchers solved this by generating synthetic datasets using the model itself, which helps stabilize training, reduces fake features, and avoids issues from data that is very different from what the model sees during use.
Why it matters?
This matters because it makes sparse autoencoders more reliable and interpretable, which helps us better understand and trust how machine learning models work, especially when they break down complex information into understandable parts.
Abstract
FaithfulSAE improves Sparse Autoencoder stability and interpretability by training on synthetic datasets generated by the model itself, reducing the occurrence of fake features and out-of-distribution data issues.