Why Language Models Hallucinate

Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, Edwin Zhang

2025-09-08

Summary

This paper explores why large language models, like advanced AI chatbots, sometimes confidently state things that aren't true – a problem called 'hallucination'. It argues this isn't a mysterious flaw, but a direct result of how these models are trained and tested.

What's the problem?

Large language models are prone to 'hallucinations,' meaning they make up plausible-sounding but incorrect information. This happens even with the best models and makes people distrust them. The core issue is that the models aren't good at saying 'I don't know' and instead guess, and current training methods actually *reward* this guessing behavior.

What's the solution?

The researchers found that hallucinations stem from the model's inability to reliably distinguish between correct and incorrect statements during its initial training. More importantly, they argue that the way we *evaluate* these models encourages hallucinations. Because most benchmarks grade models on how well they answer, even if it means guessing, models learn to prioritize appearing confident over being truthful. The proposed solution isn't to create new tests for hallucinations, but to change how existing, popular tests are scored to reward honesty and acknowledging uncertainty.

Why it matters?

Fixing this problem is crucial for building trustworthy AI. If AI systems consistently make things up, people won't rely on them for important tasks. By changing how we train and evaluate these models, we can encourage them to be more honest and reliable, ultimately leading to more useful and dependable AI technology.

Abstract

Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.

View Paper