Representational Stability of Truth in Large Language Models

Samantha Dies, Courtney Maynard, Germans Savcisens, Tina Eliassi-Rad

2025-11-25

Representational Stability of Truth in Large Language Models

Summary

This paper investigates how well large language models (LLMs) consistently understand the difference between things that are true, false, and things that aren't really statements that can be true or false at all, like questions about fictional characters.

What's the problem?

LLMs are good at answering factual questions, but it's unclear if they *really* understand truth, or if they're just good at pattern matching. The problem is that LLMs might change their idea of what's true or false if you slightly change the rules, especially when dealing with information they haven't encountered before. The researchers wanted to see how stable these models are when defining what constitutes 'truth'.

What's the solution?

The researchers trained a simple tool to look at the internal workings of several LLMs and figure out how they separate true statements from untrue ones. Then, they subtly changed what was considered 'true' by introducing statements about things that are likely not in the model's training data (like obscure facts) and statements about things that are clearly fictional (like Harry Potter). They measured how much the tool's ability to identify truth changed with these new definitions. They tested this on sixteen different LLMs across three different areas of knowledge.

Why it matters?

This research shows that LLMs struggle more with things they haven't 'seen' before than with things they know are fictional. This suggests that their understanding of truth is based more on familiarity than on actual understanding of facts. This is important because it gives us a way to test and improve LLMs so they can be more reliable and consistent in their answers, even when faced with uncertain or unfamiliar information, and it suggests focusing on consistent internal representations of truth rather than just getting the right answer.

Abstract

Large language models (LLMs) are widely used for factual tasks such as "What treats asthma?" or "What is the capital of Latvia?". However, it remains unclear how stably LLMs encode distinctions between true, false, and neither-true-nor-false content in their internal probabilistic representations. We introduce representational stability as the robustness of an LLM's veracity representations to perturbations in the operational definition of truth. We assess representational stability by (i) training a linear probe on an LLM's activations to separate true from not-true statements and (ii) measuring how its learned decision boundary shifts under controlled label changes. Using activations from sixteen open-source models and three factual domains, we compare two types of neither statements. The first are fact-like assertions about entities we believe to be absent from any training data. We call these unfamiliar neither statements. The second are nonfactual claims drawn from well-known fictional contexts. We call these familiar neither statements. The unfamiliar statements induce the largest boundary shifts, producing up to 40% flipped truth judgements in fragile domains (such as word definitions), while familiar fictional statements remain more coherently clustered and yield smaller changes (leq 8.2%). These results suggest that representational stability stems more from epistemic familiarity than from linguistic form. More broadly, our approach provides a diagnostic for auditing and training LLMs to preserve coherent truth assignments under semantic uncertainty, rather than optimizing for output accuracy alone.

View Paper