End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning
Qiaoyu Zheng, Yuze Sun, Chaoyi Wu, Weike Zhao, Pengcheng Qiu, Yongguo Yu, Kun Sun, Yanfeng Wang, Ya Zhang, Weidi Xie
2025-08-25
Summary
This paper introduces a new system called Deep-DxSearch, which aims to improve the accuracy of medical diagnoses made by large language models (LLMs). It tackles the issues of LLMs sometimes making things up or lacking necessary knowledge when trying to figure out what's wrong with a patient.
What's the problem?
Medical LLMs often struggle with accurate diagnoses because they don't always have all the necessary medical knowledge and can sometimes 'hallucinate,' meaning they generate incorrect or nonsensical information. While connecting these models to external databases helps, it's often done in a way that doesn't fully utilize the available information or clearly show *how* the model arrived at its conclusion. Essentially, it's hard to trust the reasoning process.
What's the solution?
The researchers created Deep-DxSearch, which is like giving the LLM a 'brain' that can actively search for information and then *explain* its reasoning. They trained the LLM using a method called reinforcement learning, where it gets rewards for things like using the correct information, structuring its thoughts logically, and, most importantly, making accurate diagnoses. The system uses a large collection of patient records and medical knowledge as its 'environment' to learn from, and the training process shapes how it uses this information to solve diagnostic problems.
Why it matters?
This work is important because it significantly improves the accuracy of medical diagnoses made by AI, even surpassing powerful models like GPT-4o. By making the reasoning process more transparent and reliable, Deep-DxSearch could help doctors make better decisions and provide more accurate preliminary diagnoses, especially for both common and rare diseases. It represents a step towards building AI tools that clinicians can confidently use to improve patient care.
Abstract
Accurate diagnosis with medical large language models is hindered by knowledge gaps and hallucinations. Retrieval and tool-augmented methods help, but their impact is limited by weak use of external knowledge and poor feedback-reasoning traceability. To address these challenges, We introduce Deep-DxSearch, an agentic RAG system trained end-to-end with reinforcement learning (RL) that enables steer tracebale retrieval-augmented reasoning for medical diagnosis. In Deep-DxSearch, we first construct a large-scale medical retrieval corpus comprising patient records and reliable medical knowledge sources to support retrieval-aware reasoning across diagnostic scenarios. More crutially, we frame the LLM as the core agent and the retrieval corpus as its environment, using tailored rewards on format, retrieval, reasoning structure, and diagnostic accuracy, thereby evolving the agentic RAG policy from large-scale data through RL. Experiments demonstrate that our end-to-end agentic RL training framework consistently outperforms prompt-engineering and training-free RAG approaches across multiple data centers. After training, Deep-DxSearch achieves substantial gains in diagnostic accuracy, surpassing strong diagnostic baselines such as GPT-4o, DeepSeek-R1, and other medical-specific frameworks for both common and rare disease diagnosis under in-distribution and out-of-distribution settings. Moreover, ablation studies on reward design and retrieval corpus components confirm their critical roles, underscoring the uniqueness and effectiveness of our approach compared with traditional implementations. Finally, case studies and interpretability analyses highlight improvements in Deep-DxSearch's diagnostic policy, providing deeper insight into its performance gains and supporting clinicians in delivering more reliable and precise preliminary diagnoses. See https://github.com/MAGIC-AI4Med/Deep-DxSearch.