Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders
Siyu Chen, Heejune Sheen, Xuyuan Xiong, Tianhao Wang, Zhuoran Yang
2025-06-18
Summary
This paper talks about a new method called Group Bias Adaptation that improves sparse autoencoders, which are tools used to find clear and simple features inside large language models. These improved autoencoders help separate mixed meanings into single, understandable parts.
What's the problem?
The problem is that inside large language models, many neurons or components carry mixed signals, making it hard to understand what each part is really doing or representing, which limits our ability to interpret how these models think.
What's the solution?
The researchers developed a mathematical framework and training method to help sparse autoencoders recover individual, clear features from the tangled information in language models. They provide strong theoretical proof that their method works and show it performs better than previous techniques.
Why it matters?
This matters because making AI models more understandable helps researchers trust and improve them, and it also allows building AI systems that are more transparent and easier to control.
Abstract
A new statistical framework and training algorithm, Group Bias Adaptation, enhance Sparse Autoencoders for recovering monosemantic features in Large Language Models, offering theoretical guarantees and superior performance.