Baichuan-M2: Scaling Medical Capability with Large Verifier System

Baichuan-M2 Team, Chengfeng Dou, Chong Liu, Fan Yang, Fei Li, Jiyuan Jia, Mingyang Chen, Qiang Ju, Shuai Wang, Shunya Dang, Tianpeng Li, Xiangrong Zeng, Yijie Zhou, Chenzheng Zhu, Da Pan, Fei Deng, Guangwei Ai, Guosheng Dong, Hongda Zhang, Jinyang Tai, Jixiang Hong, Kai Lu

2025-09-03

Baichuan-M2: Scaling Medical Capability with Large Verifier System

Summary

This paper focuses on improving how well large language models, or LLMs, can be used in healthcare by making them better at handling real-life medical situations, not just passing traditional medical exams.

What's the problem?

Current medical LLMs do really well on tests like the USMLE, which are basically multiple-choice questions, but they struggle when it comes to actual doctor-patient conversations and making decisions in a dynamic clinical setting. This is because those tests don't mimic the back-and-forth, interactive nature of a real medical consultation where you need to ask questions and adjust your thinking based on the answers you get.

What's the solution?

The researchers created a system that simulates realistic patient interactions using actual, but anonymized, medical records. This 'Patient Simulator' works with a 'Clinical Rubrics Generator' which creates detailed ways to evaluate the LLM's performance. They then used this system to train a new LLM called Baichuan-M2, using a special training method called reinforcement learning, and an improved algorithm to help it learn effectively. This training helped the model get better at handling complex medical scenarios.

Why it matters?

This work is important because it shows that simply training LLMs on existing medical exams isn't enough to make them useful in real-world healthcare. By creating a dynamic testing and training environment, they were able to build a model that performs significantly better than others, even some that aren't publicly available, and demonstrates a new level of performance for medical AI given its size. This means we can potentially get more capable medical AI without needing extremely large and expensive models.

Abstract

As large language models (LLMs) advance in conversational and reasoning capabilities, their practical application in healthcare has become a critical research focus. However, there is a notable gap between the performance of medical LLMs on static benchmarks such as USMLE and their utility in real-world clinical decision-making. This discrepancy arises because traditional exams fail to capture the dynamic, interactive nature of medical consultations. To address this challenge, we introduce a novel dynamic verification framework that moves beyond static answer verifier, establishing a large-scale, high-fidelity interactive reinforcement learning system. Our framework comprises two key components: a Patient Simulator that creates realistic clinical environments using de-identified medical records, and a Clinical Rubrics Generator that dynamically produces multi-dimensional evaluation metrics. Building on this foundation, we develop Baichuan-M2, a 32B-parameter medical augmented reasoning model trained through a multi-stage reinforcement learning strategy with an improved Group Relative Policy Optimization (GRPO) algorithm. Evaluated on HealthBench, Baichuan-M2 outperforms all other open-source models and most advanced closed-source counterparts, achieving a score above 32 on the challenging HealthBench Hard benchmark-previously exceeded only by GPT-5. Our work demonstrates that robust dynamic verifier system is essential for aligning LLM capabilities with practical clinical applications, establishing a new Pareto front in the performance-parameter trade-off for medical AI deployment.

View Paper