Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

Jack Gallifant, Shan Chen, Pedro Moreira, Nikolaj Munch, Mingye Gao, Jackson Pond, Leo Anthony Celi, Hugo Aerts, Thomas Hartvigsen, Danielle Bitterman

2024-06-19

Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

Summary

This paper discusses how language models, which are AI systems that understand and generate human language, struggle to accurately handle drug names in medical contexts. The authors created a new dataset called RABBITS to test how well these models perform when brand and generic drug names are swapped.

What's the problem?

Medical knowledge is often complex and relies on context, especially when it comes to drug names. Patients frequently use brand names (like Advil or Tylenol) instead of their generic counterparts (like ibuprofen or acetaminophen). This inconsistency can lead to confusion and errors in medical advice provided by AI systems. The problem is that many language models do not handle these variations well, which can result in incorrect information being given to patients.

What's the solution?

To address this issue, the authors developed the RABBITS dataset, which evaluates how well language models perform on medical questions after swapping brand names with their generic equivalents. They tested both open-source and API-based language models using established medical benchmarks (MedQA and MedMCQA) and found that the performance of these models dropped by 1-10% when drug names were changed. The study also identified that a possible reason for this fragility is the contamination of test data in the datasets used to train these models, meaning that some test questions may have been too similar to what the models had seen before.

Why it matters?

This research is important because it highlights a critical weakness in AI systems used in healthcare. If language models cannot accurately recognize and respond to different drug names, they risk providing incorrect medical advice, which could have serious consequences for patient safety. By understanding these limitations and improving how models handle drug name variations, we can work towards making AI tools more reliable and effective in medical settings.

Abstract

Medical knowledge is context-dependent and requires consistent reasoning across various natural language expressions of semantically equivalent phrases. This is particularly crucial for drug names, where patients often use brand names like Advil or Tylenol instead of their generic equivalents. To study this, we create a new robustness dataset, RABBITS, to evaluate performance differences on medical benchmarks after swapping brand and generic drug names using physician expert annotations. We assess both open-source and API-based LLMs on MedQA and MedMCQA, revealing a consistent performance drop ranging from 1-10\%. Furthermore, we identify a potential source of this fragility as the contamination of test data in widely used pre-training datasets. All code is accessible at https://github.com/BittermanLab/RABBITS, and a HuggingFace leaderboard is available at https://huggingface.co/spaces/AIM-Harvard/rabbits-leaderboard.

View Paper