From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers
Praneet Suresh, Jack Stanley, Sonia Joseph, Luca Scimeca, Danilo Bzdok
2025-09-10
Summary
This research investigates why powerful AI systems, specifically those built using 'transformer' models, sometimes confidently state things that aren't true – a phenomenon called 'hallucination'. It aims to understand *how* and *when* these hallucinations occur, and what's happening inside the AI's 'brain' when they do.
What's the problem?
As AI gets better and is used in important areas like science and business, we need to be able to trust it. A major problem is that these AI systems can 'hallucinate,' meaning they generate incorrect or nonsensical information while sounding very certain. This lack of reliability hinders their widespread adoption, especially when mistakes could have serious consequences. We don't fully understand *why* these models hallucinate, making it hard to fix the issue.
What's the solution?
The researchers used a technique called 'sparse autoencoders' to peek inside the transformer models and see how they represent concepts. They systematically tested the AI with increasingly unclear or random inputs. They found that when the input is messy or uncertain, the AI starts relying on internal concepts that aren't actually related to the input, but still seem meaningful to the model. Essentially, the AI fills in the gaps with its own ideas, leading to hallucinations. They could even predict when a hallucination would happen just by looking at the AI's internal activity.
Why it matters?
Understanding why AI hallucinates is crucial for making these systems safer and more reliable. This research provides insights that can help us align AI with human values, protect against malicious attacks that exploit these weaknesses, and even automatically assess how likely an AI is to generate false information. It's a step towards building AI we can truly trust.
Abstract
As generative AI systems become competent and democratized in science, business, and government, deeper insight into their failure modes now poses an acute need. The occasional volatility in their behavior, such as the propensity of transformer models to hallucinate, impedes trust and adoption of emerging AI solutions in high-stakes areas. In the present work, we establish how and when hallucinations arise in pre-trained transformer models through concept representations captured by sparse autoencoders, under scenarios with experimentally controlled uncertainty in the input space. Our systematic experiments reveal that the number of semantic concepts used by the transformer model grows as the input information becomes increasingly unstructured. In the face of growing uncertainty in the input space, the transformer model becomes prone to activate coherent yet input-insensitive semantic features, leading to hallucinated output. At its extreme, for pure-noise inputs, we identify a wide variety of robustly triggered and meaningful concepts in the intermediate activations of pre-trained transformer models, whose functional integrity we confirm through targeted steering. We also show that hallucinations in the output of a transformer model can be reliably predicted from the concept patterns embedded in transformer layer activations. This collection of insights on transformer internal processing mechanics has immediate consequences for aligning AI models with human values, AI safety, opening the attack surface for potential adversarial attacks, and providing a basis for automatic quantification of a model's hallucination risk.