MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine
Yunfei Xie, Ce Zhou, Lang Gao, Juncheng Wu, Xianhang Li, Hong-Yu Zhou, Sheng Liu, Lei Xing, James Zou, Cihang Xie, Yuyin Zhou
2024-08-07

Summary
This paper introduces MedTrinity-25M, a large and detailed dataset designed for medical applications, which includes over 25 million images and annotations for more than 65 diseases.
What's the problem?
In the medical field, having high-quality data is essential for training AI models that can assist in diagnosing diseases. However, existing datasets often lack detailed annotations and are limited by the availability of image-text pairs, making it difficult for AI to learn effectively from them.
What's the solution?
The authors developed MedTrinity-25M, which features a comprehensive collection of images across ten different types (modalities) and provides extensive annotations. These annotations include both general information about diseases and specific details about regions of interest in the images. They created an automated system to generate these annotations without needing paired text descriptions, allowing for a richer dataset. This dataset supports various tasks, such as generating captions and reports, as well as classifying and segmenting images.
Why it matters?
MedTrinity-25M is significant because it provides a robust resource for training medical AI models, helping improve their performance in real-world applications. By offering detailed and diverse data, this dataset can advance research in medical AI, leading to better diagnostic tools and ultimately improving patient care.
Abstract
This paper introduces MedTrinity-25M, a comprehensive, large-scale multimodal dataset for medicine, covering over 25 million images across 10 modalities, with multigranular annotations for more than 65 diseases. These enriched annotations encompass both global textual information, such as disease/lesion type, modality, region-specific descriptions, and inter-regional relationships, as well as detailed local annotations for regions of interest (ROIs), including bounding boxes, segmentation masks. Unlike existing approach which is limited by the availability of image-text pairs, we have developed the first automated pipeline that scales up multimodal data by generating multigranular visual and texual annotations (in the form of image-ROI-description triplets) without the need for any paired text descriptions. Specifically, data from over 90 different sources have been collected, preprocessed, and grounded using domain-specific expert models to identify ROIs related to abnormal regions. We then build a comprehensive knowledge base and prompt multimodal large language models to perform retrieval-augmented generation with the identified ROIs as guidance, resulting in multigranular texual descriptions. Compared to existing datasets, MedTrinity-25M provides the most enriched annotations, supporting a comprehensive range of multimodal tasks such as captioning and report generation, as well as vision-centric tasks like classification and segmentation. Pretraining on MedTrinity-25M, our model achieves state-of-the-art performance on VQA-RAD and PathVQA, surpassing both multimodal large language models and other representative SoTA approaches. This dataset can also be utilized to support large-scale pre-training of multimodal medical AI models, contributing to the development of future foundation models in the medical domain.