Evolving Diagnostic Agents in a Virtual Clinical Environment

Pengcheng Qiu, Chaoyi Wu, Junwei Liu, Qiaoyu Zheng, Yusheng Liao, Haowen Wang, Yun Yue, Qianrui Fan, Shuai Zhen, Jian Wang, Jinjie Gu, Yanfeng Wang, Ya Zhang, Weidi Xie

2025-10-30

Evolving Diagnostic Agents in a Virtual Clinical Environment

Summary

This paper introduces a new way to train large language models (LLMs) to act like doctors, specifically to diagnose illnesses. Instead of just giving the LLM a bunch of medical cases to read, the researchers created a system where the LLM learns by *doing* – by asking questions, ordering tests, and getting feedback on whether its decisions are leading to the right diagnosis.

What's the problem?

Current LLMs that try to diagnose medical conditions are often trained by simply being shown summaries of cases. This is like learning to play basketball only by reading a book about it – you don't get the experience of actually playing the game. These models struggle with the dynamic, back-and-forth nature of real-world diagnosis, where doctors need to adapt their approach based on test results and patient responses. They don't learn *how* to diagnose, just *what* diagnoses often look like in specific situations.

What's the solution?

The researchers built a simulated medical environment called DiagGym, which uses real patient data to realistically show what happens when you order a test or ask a question. Then, they used a technique called reinforcement learning to train an LLM, named DiagAgent, to navigate this environment. DiagAgent learns by trial and error, getting 'rewards' for making good decisions that lead to accurate diagnoses and 'penalties' for making mistakes. They also created a new set of medical cases, DiagBench, to test how well DiagAgent performs. Essentially, they created a virtual doctor's office where the AI could practice and improve.

Why it matters?

This research shows that LLMs can become much better at diagnosis if they're allowed to learn through interaction and feedback, rather than just passively reading information. DiagAgent significantly outperformed other leading LLMs, like GPT-4o, in both accuracy and its ability to choose the right tests. This is a big step towards creating AI tools that can genuinely assist doctors and improve patient care, because it demonstrates the value of letting AI learn to *think* like a doctor, not just *memorize* medical facts.

Abstract

In this paper, we present a framework for training large language models (LLMs) as diagnostic agents with reinforcement learning, enabling them to manage multi-turn diagnostic processes, adaptively select examinations, and commit to final diagnoses. Unlike instruction-tuned models trained on static case summaries, our method acquires diagnostic strategies through interactive exploration and outcome-based feedback. Our contributions are fourfold: (i) We present DiagGym, a diagnostics world model trained with electronic health records that emits examination outcomes conditioned on patient history and recommended examination, serving as a virtual clinical environment for realistic diagnosis training and evaluation; (ii) We train DiagAgent via end-to-end, multi-turn reinforcement learning to learn diagnostic policies that optimize both information yield and diagnostic accuracy; (iii) We introduce DiagBench, a diagnostic benchmark comprising 750 cases with physician-validated examination recommendations and 99 cases annotated with 973 physician-written rubrics on diagnosis process; (iv) we demonstrate superior performance across diverse diagnostic settings. DiagAgent significantly outperforms 10 state-of-the-art LLMs, including DeepSeek-v3 and GPT-4o, as well as two prompt-engineered agents. In single-turn settings, DiagAgent achieves 9.34% higher diagnostic accuracy and 44.03% improvement in examination recommendation hit ratio. In end-to-end settings, it delivers 15.12% increase in diagnostic accuracy and 23.09% boost in examination recommendation F1 score. In rubric-based evaluation, it surpasses the next-best model, Claude-sonnet-4, by 7.1% in weighted rubric score. These findings indicate that learning policies in interactive clinical environments confers dynamic and clinically meaningful diagnostic management abilities unattainable through passive training alone.

View Paper