How to Steer LLM Latents for Hallucination Detection?

Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, Yixuan Li

2025-03-07

How to Steer LLM Latents for Hallucination Detection?

Summary

This paper talks about a new method called the Truthfulness Separator Vector (TSV) that helps detect when AI language models are making up false information, also known as hallucinations

What's the problem?

AI language models sometimes generate false or misleading information, which is a big problem for using them safely in real-world situations. Current methods to detect these hallucinations don't clearly separate true statements from made-up ones in the AI's internal representation of language

What's the solution?

The researchers created TSV, a special tool that adjusts how the AI model represents information internally, making it easier to tell true statements apart from hallucinations. TSV works in two steps: first, it learns from a small set of labeled examples, then it uses that knowledge to sort through a larger set of unlabeled AI-generated text. This approach doesn't change the AI model itself, making it easy to use with existing systems

Why it matters?

This matters because it makes AI language models more trustworthy and safer to use in important real-world applications. TSV can detect hallucinations more accurately than previous methods, even when using very little labeled data, and it works well across different types of texts. This could help prevent the spread of AI-generated misinformation and make AI assistants more reliable in fields like healthcare, education, and business

Abstract

Hallucinations in LLMs pose a significant concern to their safe deployment in real-world applications. Recent approaches have leveraged the latent space of LLMs for hallucination detection, but their embeddings, optimized for linguistic coherence rather than factual accuracy, often fail to clearly separate truthful and hallucinated content. To this end, we propose the Truthfulness Separator Vector (TSV), a lightweight and flexible steering vector that reshapes the LLM's representation space during inference to enhance the separation between truthful and hallucinated outputs, without altering model parameters. Our two-stage framework first trains TSV on a small set of labeled exemplars to form compact and well-separated clusters. It then augments the exemplar set with unlabeled LLM generations, employing an optimal transport-based algorithm for pseudo-labeling combined with a confidence-based filtering process. Extensive experiments demonstrate that TSV achieves state-of-the-art performance with minimal labeled data, exhibiting strong generalization across datasets and providing a practical solution for real-world LLM applications.

View Paper