Extracting Interaction-Aware Monosemantic Concepts in Recommender Systems
Dor Arviv, Yehonatan Elisha, Oren Barkan, Noam Koenigstein
2025-11-25
Summary
This paper introduces a new technique for understanding what different parts of a recommender system 'think' about when suggesting items to users, by identifying specific concepts each part represents.
What's the problem?
Recommender systems, like those used by Netflix or Amazon, are often 'black boxes' – we know they work, but it's hard to understand *why* they make certain recommendations. While researchers have made progress in understanding language models, applying those techniques to recommender systems is tricky because recommender systems need to consider both users *and* items and how they relate to each other. Simply finding concepts in users or items alone isn't enough; the concepts need to explain the connections between them.
What's the solution?
The researchers used a special type of neural network called a Sparse Autoencoder to analyze the existing user and item data within a recommender system. Importantly, they designed a new way to train this network that focuses on making sure the concepts it learns actually align with the recommender system’s existing predictions. This 'prediction aware training' ensures the concepts explain *why* the system recommends what it does. The result is a set of 'neurons' within the network, each representing a single concept like a specific genre, how popular something is, or even when it was released.
Why it matters?
This work is important because it provides a way to make recommender systems more transparent and controllable. By understanding the concepts the system uses, we can potentially filter recommendations based on those concepts, promote certain types of content, or generally improve personalization without having to completely rebuild the recommender system itself. It’s a practical tool for making these systems more interpretable and useful.
Abstract
We present a method for extracting monosemantic neurons, defined as latent dimensions that align with coherent and interpretable concepts, from user and item embeddings in recommender systems. Our approach employs a Sparse Autoencoder (SAE) to reveal semantic structure within pretrained representations. In contrast to work on language models, monosemanticity in recommendation must preserve the interactions between separate user and item embeddings. To achieve this, we introduce a prediction aware training objective that backpropagates through a frozen recommender and aligns the learned latent structure with the model's user-item affinity predictions. The resulting neurons capture properties such as genre, popularity, and temporal trends, and support post hoc control operations including targeted filtering and content promotion without modifying the base model. Our method generalizes across different recommendation models and datasets, providing a practical tool for interpretable and controllable personalization. Code and evaluation resources are available at https://github.com/DeltaLabTLV/Monosemanticity4Rec.