Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, Neel Nanda
2024-11-22

Summary
This paper explores the problem of 'hallucinations' in large language models (LLMs), which occur when these models generate incorrect or nonsensical information, and investigates how these models recognize entities they can recall facts about.
What's the problem?
Hallucinations in LLMs are a common issue where the model produces outputs that seem plausible but are actually false or ungrounded. Understanding why these hallucinations happen is crucial, as it limits our ability to improve the models and avoid incorrect responses. The mechanisms behind this problem, particularly how models recognize entities they know about, are not well understood.
What's the solution?
The authors use a tool called sparse autoencoders to analyze how LLMs recognize entities. They discover that the model has internal representations that help it determine whether it knows facts about a specific entity, like a person or a movie. By identifying 'vital directions' in the model's memory, they show how these representations influence the model's ability to refuse to answer questions about unknown entities or to incorrectly create attributes for those it doesn't recognize. This insight suggests that LLMs have a form of self-awareness regarding their knowledge.
Why it matters?
This research is significant because it provides a deeper understanding of how LLMs work and why they sometimes produce hallucinated information. By improving our knowledge of entity recognition within these models, we can develop better strategies to reduce hallucinations and enhance the reliability of AI systems, making them safer and more accurate for users.
Abstract
Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using sparse autoencoders as an interpretability tool, we discover that a key part of these mechanisms is entity recognition, where the model detects if an entity is one it can recall facts about. Sparse autoencoders uncover meaningful directions in the representation space, these detect whether the model recognizes an entity, e.g. detecting it doesn't know about an athlete or a movie. This suggests that models can have self-knowledge: internal representations about their own capabilities. These directions are causally relevant: capable of steering the model to refuse to answer questions about known entities, or to hallucinate attributes of unknown entities when it would otherwise refuse. We demonstrate that despite the sparse autoencoders being trained on the base model, these directions have a causal effect on the chat model's refusal behavior, suggesting that chat finetuning has repurposed this existing mechanism. Furthermore, we provide an initial exploration into the mechanistic role of these directions in the model, finding that they disrupt the attention of downstream heads that typically move entity attributes to the final token.