MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models

Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, Ying Ding

2025-02-24

MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations
in Large Language Models

Summary

This paper talks about MedHallu, a new tool created to test how well AI language models can spot made-up or incorrect information in medical questions and answers.

What's the problem?

As AI is being used more in healthcare to answer medical questions, there's a big risk that these AI systems might give answers that sound right but are actually wrong. This is called 'hallucination,' and it could be really dangerous for patients if doctors rely on this incorrect information.

What's the solution?

The researchers made MedHallu, which is like a big quiz with 10,000 medical questions and answers. Some of the answers are correct, and some are made-up. They tested different AI models, including some of the best ones, to see how good they are at spotting the fake answers. They also tried different ways to make the AI better at this task, like giving it more medical knowledge and letting it say 'I'm not sure' when it's not confident.

Why it matters?

This matters because as we use AI more in healthcare, we need to make sure it's safe and reliable. MedHallu helps us see where AI might make mistakes in medical information, which could prevent wrong diagnoses or treatments. It also shows us how to make AI better at this important task, which could make healthcare safer and more accurate in the future.

Abstract

Advancements in Large Language Models (LLMs) and their increasing use in medical question-answering necessitate rigorous evaluation of their reliability. A critical challenge lies in hallucination, where models generate plausible yet factually incorrect outputs. In the medical domain, this poses serious risks to patient safety and clinical decision-making. To address this, we introduce MedHallu, the first benchmark specifically designed for medical hallucination detection. MedHallu comprises 10,000 high-quality question-answer pairs derived from PubMedQA, with hallucinated answers systematically generated through a controlled pipeline. Our experiments show that state-of-the-art LLMs, including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, struggle with this binary hallucination detection task, with the best model achieving an F1 score as low as 0.625 for detecting "hard" category hallucinations. Using bidirectional entailment clustering, we show that harder-to-detect hallucinations are semantically closer to ground truth. Through experiments, we also show incorporating domain-specific knowledge and introducing a "not sure" category as one of the answer categories improves the precision and F1 scores by up to 38% relative to baselines.

View Paper