Truth Neurons
Haohang Li, Yupeng Cao, Yangyang Yu, Jordan W. Suchow, Zining Zhu
2025-05-21
Summary
This paper talks about special parts of AI models, called 'truth neurons,' that help the AI give truthful answers no matter what the topic is.
What's the problem?
Sometimes AI models can give answers that aren't true, and it's hard to control or understand exactly how they decide what is true or false.
What's the solution?
The researchers found certain neurons in the AI that are responsible for making its answers truthful, and showed that if you turn off these neurons, the AI's performance gets worse on many different tests.
Why it matters?
Understanding and controlling these truth neurons can help us build AI that is more reliable and honest, which is really important when people depend on AI for accurate information.
Abstract
Truth neurons encode truthfulness in language models at a subject-agnostic level, and suppressing these neurons degrades performance across benchmarks.