EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis

Yusheng Liao, Chaoyi Wu, Junwei Liu, Shuyang Jiang, Pengcheng Qiu, Haowen Wang, Yun Yue, Shuai Zhen, Jian Wang, Qianrui Fan, Jinjie Gu, Ya Zhang, Yanfeng Wang, Yu Wang, Weidi Xie

2025-10-31

EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis

Summary

This paper focuses on improving how well computers can understand and use the information found in Electronic Health Records, or EHRs, to help doctors make better decisions.

What's the problem?

While large language models, like the ones powering chatbots, are getting good at many things, they struggle with the specific complexities of EHRs. They often can't handle the wide range of tasks doctors need help with, and they lack the ability to reason through medical information like a clinician would. Essentially, current AI isn't 'thinking' through patient charts effectively.

What's the solution?

The researchers created a massive dataset called EHR-Ins, filled with 300,000 examples of medical reasoning and 4 million examples of general EHR data, covering 42 different medical tasks. They used a new method, based on 'thinking graphs,' to automatically generate this high-quality data. Then, they built a new series of language models, called EHR-R1, specifically trained on this data. These models were trained in stages, first learning general medical knowledge, then focusing on reasoning skills, and finally refining their performance through a process similar to trial and error. They also created a new testing benchmark, EHR-Bench, to evaluate how well these models perform on real-world EHR tasks.

Why it matters?

This work is important because it significantly improves the accuracy and reliability of AI systems used for analyzing EHRs. The new models, EHR-R1, outperform existing models – even powerful ones like GPT-4o – in understanding and predicting medical outcomes. This means better tools for doctors, potentially leading to more accurate diagnoses, more effective treatments, and ultimately, better patient care.

Abstract

Electronic Health Records (EHRs) contain rich yet complex information, and their automated analysis is critical for clinical decision-making. Despite recent advances of large language models (LLMs) in clinical workflows, their ability to analyze EHRs remains limited due to narrow task coverage and lack of EHR-oriented reasoning capabilities. This paper aims to bridge the gap, specifically, we present EHR-Ins, a large-scale, comprehensive EHR reasoning instruction dataset, comprising 300k high-quality reasoning cases and 4M non-reasoning cases across 42 distinct EHR tasks. Its core innovation is a thinking-graph-driven framework that enables to generate high-quality reasoning data at scale. Based on it, we develop EHR-R1, a series of reasoning-enhanced LLMs with up to 72B parameters tailored for EHR analysis. Through a multi-stage training paradigm, including domain adaptation, reasoning enhancement, and reinforcement learning, EHR-R1 systematically acquires domain knowledge and diverse reasoning capabilities, enabling accurate and robust EHR analysis. Lastly, we introduce EHR-Bench, a new benchmark curated from MIMIC-IV, spanning 42 tasks, to comprehensively assess reasoning and prediction across EHR scenarios. In experiments, we show that the resulting EHR-R1 consistently outperforms state-of-the-art commercial and open-source LLMs (including DeepSeek-V3 and GPT-4o), surpassing GPT-4o by over 30 points on MIMIC-Bench and achieving a 10\% higher zero-shot AUROC on EHRSHOT. Collectively, EHR-Ins, EHR-R1, and EHR-Bench have significantly advanced the development for more reliable and clinically relevant EHR analysis.

View Paper