Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks

Krithik Vishwanath, Mrigayu Ghosh, Anton Alyakin, Daniel Alexander Alber, Yindalon Aphinyanaphongs, Eric Karl Oermann

2025-12-02

Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks

Summary

This paper investigates whether AI systems specifically designed for medical use are actually better than more general AI models, like those powering chatbots. It challenges the idea that specialized medical AI is automatically safer or more reliable.

What's the problem?

There's a growing use of AI tools in healthcare to help with things like diagnosis and treatment recommendations, but these tools aren't usually tested as rigorously as the large, general AI models everyone is talking about. This means we don't really *know* if the medical AI is actually good at its job, despite it being used to make important decisions about patient care.

What's the solution?

Researchers compared two popular clinical AI systems – OpenEvidence and UpToDate Expert AI – to three leading general AI models: GPT-5, Gemini 3 Pro, and Claude Sonnet 4.5. They used a set of 1,000 medical questions and scenarios to test both the medical AI and the general AI on medical knowledge and how well their answers aligned with what doctors would expect. The tests focused on things like completeness of answers, clarity, understanding the context of the question, and safety considerations.

Why it matters?

The study found that the general AI models consistently performed *better* than the specialized medical AI tools. This is a big deal because it suggests that tools marketed to doctors and hospitals might not be as advanced or reliable as people think. It highlights the need for independent testing and transparency before these AI systems are widely used in healthcare to ensure they actually improve patient outcomes and don't cause harm.

Abstract

Specialized clinical AI assistants are rapidly entering medical practice, often framed as safer or more reliable than general-purpose large language models (LLMs). Yet, unlike frontier models, these clinical tools are rarely subjected to independent, quantitative evaluation, creating a critical evidence gap despite their growing influence on diagnosis, triage, and guideline interpretation. We assessed two widely deployed clinical AI systems (OpenEvidence and UpToDate Expert AI) against three state-of-the-art generalist LLMs (GPT-5, Gemini 3 Pro, and Claude Sonnet 4.5) using a 1,000-item mini-benchmark combining MedQA (medical knowledge) and HealthBench (clinician-alignment) tasks. Generalist models consistently outperformed clinical tools, with GPT-5 achieving the highest scores, while OpenEvidence and UpToDate demonstrated deficits in completeness, communication quality, context awareness, and systems-based safety reasoning. These findings reveal that tools marketed for clinical decision support may often lag behind frontier LLMs, underscoring the urgent need for transparent, independent evaluation before deployment in patient-facing workflows.

View Paper