Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

Gagan Bhatia, Somayajulu G Sripada, Kevin Allan, Jacobo Azcona

2025-10-08

Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

Summary

This research dives into why large language models, like the ones powering chatbots, sometimes confidently state things that aren't true – a problem called 'hallucination'. It doesn't just point out the problem, but tries to figure out *where* in the model's internal workings these errors originate and *how* they happen.

What's the problem?

Large language models are really good at sounding convincing, even when they're completely making things up. This is a major issue because it makes it hard to trust the information they provide. The core problem is understanding *why* these models hallucinate; is it a flaw in the data they were trained on, or something about the way the model itself is built?

What's the solution?

The researchers developed a new method called 'Distributional Semantics Tracing' to essentially map out the model's thought process. This allowed them to pinpoint a specific layer within the model where the errors become unavoidable – a point of no return from factual accuracy. They discovered that the model uses two different 'thinking' styles: a quick, intuitive one and a slower, more careful one. Hallucinations happen when the quick, intuitive style takes over and leads the model down the wrong path, overriding the more reliable, contextual style.

Why it matters?

Understanding the root cause of hallucinations is crucial for building more reliable AI. By identifying the specific layer and the internal conflict that causes these errors, researchers can start to develop ways to fix the problem and make these models more trustworthy. This work provides a detailed, mechanistic explanation of *how* and *why* hallucinations occur, which is a big step towards creating AI we can depend on.

Abstract

Large Language Models (LLMs) are prone to hallucination, the generation of plausible yet factually incorrect statements. This work investigates the intrinsic, architectural origins of this failure mode through three primary contributions.First, to enable the reliable tracing of internal semantic failures, we propose Distributional Semantics Tracing (DST), a unified framework that integrates established interpretability techniques to produce a causal map of a model's reasoning, treating meaning as a function of context (distributional semantics). Second, we pinpoint the model's layer at which a hallucination becomes inevitable, identifying a specific commitment layer where a model's internal representations irreversibly diverge from factuality. Third, we identify the underlying mechanism for these failures. We observe a conflict between distinct computational pathways, which we interpret using the lens of dual-process theory: a fast, heuristic associative pathway (akin to System 1) and a slow, deliberate contextual pathway (akin to System 2), leading to predictable failure modes such as Reasoning Shortcut Hijacks. Our framework's ability to quantify the coherence of the contextual pathway reveals a strong negative correlation (rho = -0.863) with hallucination rates, implying that these failures are predictable consequences of internal semantic weakness. The result is a mechanistic account of how, when, and why hallucinations occur within the Transformer architecture.

View Paper