ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning
Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao, Long Li, Yu Rong, Wenbing Huang, Qifeng Bai, Tingyang Xu
2025-06-15
Summary
ReasonMed is a huge medical reasoning dataset with 370,000 high-quality examples. It was created using multiple powerful language models working together to generate different detailed reasoning paths for medical questions. These reasoning paths were carefully checked and improved to make sure they were accurate and medically correct. By combining detailed thought steps with short answer summaries, ReasonMed helps train medical question-answering models to be more accurate.
What's the problem?
Current large language models are very good at subjects like math and coding, but they struggle with medical questions that need deep knowledge and careful reasoning. There was also a lack of large, carefully verified medical reasoning data to teach these models how to think through complex medical problems correctly.
What's the solution?
The researchers made ReasonMed by having several advanced language models generate over a million potential reasoning paths for medical questions. Then they used a strong verification and error correction process to select the best 370,000 examples. They tested different ways to train models and found that using detailed reasoning steps along with concise summaries worked best. Using this method, their trained model, ReasonMed-7B, set new performance records, even beating some much larger models on medical question tests.
Why it matters?
ReasonMed is important because it provides a large, reliable, and high-quality set of medical reasoning examples that can train AI models to better understand and answer complex medical questions. This can improve the accuracy of medical AI tools, which could help doctors make better decisions and ultimately improve patient care.
Abstract
ReasonMed, a large medical reasoning dataset, enhances the accuracy of medical question answering models by combining detailed reasoning paths with concise summaries, setting new benchmarks for model performance.