Large Language Models Do NOT Really Know What They Don't Know
Chi Seng Cheang, Hou Pong Chan, Wenxuan Zhang, Yang Deng
2025-10-17
Summary
This paper investigates whether large language models (LLMs) actually 'know' when they are making things up, or 'hallucinating' facts. It challenges the idea that LLMs have a clear internal sense of truthfulness.
What's the problem?
LLMs are known to sometimes generate incorrect information, even though they are very good at predicting text. The question is whether the way these models work internally allows them to distinguish between correct information they've learned and completely fabricated information. It's confusing because the same processes that help them get things right can also lead to errors, so can we trust their internal workings to tell the difference between fact and fiction?
What's the solution?
Researchers looked closely at *how* LLMs process questions when they get the answer right versus when they make up an answer. They focused on two kinds of made-up answers: those related to things the model *should* know, and those that are completely random. They found that when the model hallucinates something related to its existing knowledge, its internal processing looks almost identical to when it answers correctly. However, when the hallucination is totally unrelated to anything it knows, the internal processing is different and more easily spotted. This was done by examining the patterns of activity within the model.
Why it matters?
This research shows that LLMs don't actually encode 'truth' internally. They are really good at recalling and combining patterns of information, but they don't have a separate mechanism for knowing whether that information is actually true. This means LLMs don't 'know what they don't know' in the way a human does, and it highlights a fundamental limitation in how these models currently work, impacting our trust in their outputs.
Abstract
Recent work suggests that large language models (LLMs) encode factuality signals in their internal representations, such as hidden states, attention weights, or token probabilities, implying that LLMs may "know what they don't know". However, LLMs can also produce factual errors by relying on shortcuts or spurious associations. These error are driven by the same training objective that encourage correct predictions, raising the question of whether internal computations can reliably distinguish between factual and hallucinated outputs. In this work, we conduct a mechanistic analysis of how LLMs internally process factual queries by comparing two types of hallucinations based on their reliance on subject information. We find that when hallucinations are associated with subject knowledge, LLMs employ the same internal recall process as for correct responses, leading to overlapping and indistinguishable hidden-state geometries. In contrast, hallucinations detached from subject knowledge produce distinct, clustered representations that make them detectable. These findings reveal a fundamental limitation: LLMs do not encode truthfulness in their internal states but only patterns of knowledge recall, demonstrating that "LLMs don't really know what they don't know".