LettuceDetect: A Hallucination Detection Framework for RAG Applications

Ádám Kovács, Gábor Recski

2025-03-03

LettuceDetect: A Hallucination Detection Framework for RAG Applications

Summary

This paper talks about LettuceDetect, a new system for spotting when AI language models make up false information in their answers, especially when they're using external information sources.

What's the problem?

Even when AI systems use external information to answer questions, they can still sometimes give wrong or made-up answers. Current methods to catch these mistakes are either limited in how much text they can handle or require a lot of computing power, making them impractical for real-world use.

What's the solution?

The researchers created LettuceDetect, which uses a special type of AI called ModernBERT that can handle longer pieces of text. They trained it on a dataset of correct and incorrect AI answers. LettuceDetect can look at the question, the information given to the AI, and the AI's answer to figure out which parts of the answer might be made up. It's much smaller than other systems but works better, and it's fast enough to use in real applications.

Why it matters?

This matters because as AI becomes more common in our daily lives, we need ways to make sure it's giving us accurate information. LettuceDetect could help make AI systems more trustworthy and useful by quickly catching mistakes, without needing super powerful computers. This could lead to safer and more reliable AI assistants in various fields like education, customer service, or research.

Abstract

Retrieval Augmented Generation (RAG) systems remain vulnerable to hallucinated answers despite incorporating external knowledge sources. We present LettuceDetect a framework that addresses two critical limitations in existing hallucination detection methods: (1) the context window constraints of traditional encoder-based methods, and (2) the computational inefficiency of LLM based approaches. Building on ModernBERT's extended context capabilities (up to 8k tokens) and trained on the RAGTruth benchmark dataset, our approach outperforms all previous encoder-based models and most prompt-based models, while being approximately 30 times smaller than the best models. LettuceDetect is a token-classification model that processes context-question-answer triples, allowing for the identification of unsupported claims at the token level. Evaluations on the RAGTruth corpus demonstrate an F1 score of 79.22% for example-level detection, which is a 14.8% improvement over Luna, the previous state-of-the-art encoder-based architecture. Additionally, the system can process 30 to 60 examples per second on a single GPU, making it more practical for real-world RAG applications.

View Paper