GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI
Tianbin Li, Yanzhou Su, Wei Li, Bin Fu, Zhe Chen, Ziyan Huang, Guoan Wang, Chenglong Ma, Ying Chen, Ming Hu, Yanjun Li, Pengcheng Chen, Xiaowei Hu, Zhongying Deng, Yuanfeng Ji, Jin Ye, Yu Qiao, Junjun He
2024-11-26

Summary
This paper introduces GMAI-VL and GMAI-VL-5.5M, a new large vision-language model and a comprehensive dataset designed to improve artificial intelligence in the medical field by combining visual and textual information.
What's the problem?
Despite advancements in AI, existing models struggle with medical tasks because they lack specialized medical knowledge. This makes it hard for them to accurately understand and analyze medical images and texts, which is crucial for effective diagnosis and treatment.
What's the solution?
To tackle this issue, the authors created GMAI-VL-5.5M, a large dataset made by converting many specialized medical datasets into image-text pairs. This dataset allows the model to learn from high-quality data about various medical tasks. They also developed GMAI-VL, a vision-language model that uses a three-stage training strategy to effectively integrate visual and textual information, improving its performance on tasks like visual question answering and medical image diagnosis.
Why it matters?
This research is important because it enhances the ability of AI systems to process complex medical information, which can lead to better decision-making in healthcare. By providing a robust dataset and an advanced model, the authors aim to advance the field of general medical AI, making it more reliable for real-world applications in medicine.
Abstract
Despite significant advancements in general artificial intelligence, such as GPT-4, their effectiveness in the medical domain (general medical AI, GMAI) remains constrained due to the absence of specialized medical knowledge. To address this challenge, we present GMAI-VL-5.5M, a comprehensive multimodal medical dataset created by converting hundreds of specialized medical datasets into meticulously constructed image-text pairs. This dataset features comprehensive task coverage, diverse modalities, and high-quality image-text data. Building upon this multimodal dataset, we propose GMAI-VL, a general medical vision-language model with a progressively three-stage training strategy. This approach significantly enhances the model's ability by integrating visual and textual information, thereby improving its ability to process multimodal data and support accurate diagnosis and clinical decision-making. Experimental evaluations demonstrate that GMAI-VL achieves state-of-the-art results across a wide range of multimodal medical tasks, such as visual question answering and medical image diagnosis. Our contributions include the development of the GMAI-VL-5.5M dataset, the introduction of the GMAI-VL model, and the establishment of new benchmarks in multiple medical domains. Code and dataset will be released at https://github.com/uni-medical/GMAI-VL.