Medical large language models are easily distracted
Krithik Vishwanath, Anton Alyakin, Daniel Alexander Alber, Jin Vivian Lee, Douglas Kondziolka, Eric Karl Oermann
2025-04-03

Summary
This paper is about how AI models used in medicine can easily get confused by irrelevant information, like background noise in a doctor's office.
What's the problem?
AI models that could help doctors are easily distracted by things that aren't important, leading to wrong answers. This is a problem because doctors' offices are full of extra information.
What's the solution?
The researchers tested AI models with medical questions that had distracting information added. They found that the models' accuracy dropped, and common fixes didn't help much.
Why it matters?
This work matters because it shows that AI in medicine needs to be better at focusing on what's important to avoid making mistakes that could harm patients.
Abstract
Large language models (LLMs) have the potential to transform medicine, but real-world clinical scenarios contain extraneous information that can hinder performance. The rise of assistive technologies like ambient dictation, which automatically generates draft notes from live patient encounters, has the potential to introduce additional noise making it crucial to assess the ability of LLM's to filter relevant data. To investigate this, we developed MedDistractQA, a benchmark using USMLE-style questions embedded with simulated real-world distractions. Our findings show that distracting statements (polysemous words with clinical meanings used in a non-clinical context or references to unrelated health conditions) can reduce LLM accuracy by up to 17.9%. Commonly proposed solutions to improve model performance such as retrieval-augmented generation (RAG) and medical fine-tuning did not change this effect and in some cases introduced their own confounders and further degraded performance. Our findings suggest that LLMs natively lack the logical mechanisms necessary to distinguish relevant from irrelevant clinical information, posing challenges for real-world applications. MedDistractQA and our results highlights the need for robust mitigation strategies to enhance LLM resilience to extraneous information.