MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
Baorong Shi, Bo Cui, Boyuan Jiang, Deli Yu, Fang Qian, Haihua Yang, Huichao Wang, Jiale Chen, Jianfei Pan, Jieqiong Cao, Jinghao Lin, Kai Wu, Lin Yang, Shengsheng Yao, Tao Chen, Xiaojun Xiao, Xiaozhong Ji, Xu Wang, Yijun He, Zhixiong Yang
2026-02-16
Summary
This paper introduces MedXIAOHE, a new artificial intelligence model designed to understand and reason about medical information, combining both images and text. It aims to be a powerful tool for real-world medical applications, performing better than existing models.
What's the problem?
Current AI models struggle with the complexity and variety of medical knowledge. They often lack understanding of less common diseases and have trouble with complex reasoning needed for diagnosis. Existing models also aren't always reliable, sometimes making things up or not following instructions well when generating reports.
What's the solution?
The researchers created MedXIAOHE using a special training process. First, they organized a huge amount of medical data, making sure to include information about rarer conditions. Then, they taught the model to think like a medical expert using reinforcement learning and by letting it use tools to help with diagnosis. Finally, they focused on making the model more trustworthy by having it explain its reasoning, base its answers on evidence, and avoid making up information when writing reports.
Why it matters?
This research is important because it creates a more capable and reliable AI assistant for doctors and medical professionals. A tool like MedXIAOHE could help improve diagnoses, speed up treatment, and ultimately lead to better patient care. The researchers are also sharing details about how they built the model to encourage further advancements in the field.
Abstract
We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities. To achieve this, we propose an entity-aware continual pretraining framework that organizes heterogeneous medical corpora to broaden knowledge coverage and reduce long-tail gaps (e.g., rare diseases). For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision traces. To improve reliability in real-world use, MedXIAOHE integrates user-preference rubrics, evidence-grounded reasoning, and low-hallucination long-form report generation, with improved adherence to medical instructions. We release this report to document our practical design choices, scaling insights, and evaluation framework, hoping to inspire further research.