Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

Lavender Y. Jiang, Angelica Chen, Xu Han, Xujin Chris Liu, Radhika Dua, Kevin Eaton, Frederick Wolff, Robert Steele, Jeff Zhang, Anton Alyakin, Qingkai Pan, Yanbing Chen, Karl L. Sangwon, Daniel A. Alber, Jaden Stryker, Jin Vivian Lee, Yindalon Aphinyanaphongs, Kyunghyun Cho, Eric Karl Oermann

2025-11-21

Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

Summary

This paper investigates whether large language models, which are good at understanding and generating text, can be effectively used to improve operations within hospitals, like predicting how long patients will stay or the chance they'll be readmitted.

What's the problem?

Hospitals make a lot of decisions about how to run things, and these decisions impact patient care and costs. While general language models are powerful, they often lack the specific medical knowledge needed to make good operational choices in a hospital setting. Existing models weren't performing well when tested on real hospital data, especially when trying to predict things like readmission rates or mortality.

What's the solution?

The researchers created a new family of language models called Lang1. These models were first trained on a massive amount of text from electronic health records at NYU Langone Health, combined with general internet text. Then, they tested and refined these models on five important hospital tasks: predicting readmission, predicting death, estimating length of stay, identifying patient conditions, and forecasting insurance claim denials. They found that even a relatively small Lang1 model (1 billion parameters) could outperform much larger general models after being specifically trained on hospital data.

Why it matters?

This research shows that specialized language models, trained on medical data, can be surprisingly effective at solving real-world problems in healthcare. It suggests that simply using large, general models isn't enough, and that focusing on training models with specific medical knowledge, and then fine-tuning them for particular tasks, is a more promising approach for improving hospital operations and ultimately, patient care.

Abstract

Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.

View Paper