Monet: Mixture of Monosemantic Experts for Transformers

Jungwoo Park, Young Jin Ahn, Kee-Eung Kim, Jaewoo Kang

2024-12-06

Monet: Mixture of Monosemantic Experts for Transformers

Summary

This paper talks about Monet, a new architecture for large language models (LLMs) that improves how these models understand and manage information, making them more interpretable and better at avoiding harmful outputs.

What's the problem?

Large language models often struggle with understanding complex ideas because their neurons can respond to multiple unrelated concepts, which makes it hard to interpret their behavior. This issue, known as polysemanticity, can lead to undesirable outcomes, like generating toxic content. Existing methods to improve this, such as Sparse Autoencoders, have hurt the models' performance.

What's the solution?

The authors introduce Monet, which uses a method called 'Mixture of Monosemantic Experts.' This approach allows the model to have many specialized experts that focus on single concepts, making it easier to understand what the model is doing. Monet combines sparse dictionary learning directly into the training process, allowing it to scale up to 262,144 experts per layer without losing performance. This structure helps the model manage knowledge better across different topics and languages while also reducing harmful outputs.

Why it matters?

This research is important because it enhances the transparency of large language models, helping developers and users understand how these models work and ensuring they align better with human values. By improving interpretability and control over model behavior, Monet could lead to safer and more effective AI applications in various fields.

Abstract

Understanding the internal computations of large language models (LLMs) is crucial for aligning them with human values and preventing undesirable behaviors like toxic content generation. However, mechanistic interpretability is hindered by polysemanticity -- where individual neurons respond to multiple, unrelated concepts. While Sparse Autoencoders (SAEs) have attempted to disentangle these features through sparse dictionary learning, they have compromised LLM performance due to reliance on post-hoc reconstruction loss. To address this issue, we introduce Mixture of Monosemantic Experts for Transformers (Monet) architecture, which incorporates sparse dictionary learning directly into end-to-end Mixture-of-Experts pretraining. Our novel expert decomposition method enables scaling the expert count to 262,144 per layer while total parameters scale proportionally to the square root of the number of experts. Our analyses demonstrate mutual exclusivity of knowledge across experts and showcase the parametric knowledge encapsulated within individual experts. Moreover, Monet allows knowledge manipulation over domains, languages, and toxicity mitigation without degrading general performance. Our pursuit of transparent LLMs highlights the potential of scaling expert counts to enhance} mechanistic interpretability and directly resect the internal knowledge to fundamentally adjust} model behavior. The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Monet.

View Paper