HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale
Junying Chen, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Ruifei Zhang, Zhenyang Cai, Ke Ji, Guangjun Yu, Xiang Wan, Benyou Wang
2024-07-01

Summary
This paper talks about HuatuoGPT-Vision, a new system designed to improve how multimodal large language models (MLLMs) handle medical information by integrating visual knowledge with text. It focuses on enhancing the ability of these models to understand and process medical images alongside written data.
What's the problem?
Despite advancements in MLLMs like GPT-4V, there are still significant challenges in the medical field due to a lack of high-quality data that combines medical images and text. Issues such as data privacy concerns and the high costs of annotating (labeling) data make it difficult to gather enough good-quality training examples. Existing datasets often contain noise or irrelevant information, which can confuse the models and lead to poor performance.
What's the solution?
To address these challenges, the authors created the PubMedVision dataset, which includes 1.3 million refined medical question-and-answer samples that pair medical images with relevant text. They improved the quality of this data by cleaning and reformatting it using advanced MLLMs in a way that reduces noise. They then used this dataset to train a new model called HuatuoGPT-Vision, which demonstrated significantly better performance in understanding medical scenarios compared to other open-source models.
Why it matters?
This research is important because it enhances how AI systems can understand and analyze medical information by effectively combining visual and textual data. By improving the capabilities of MLLMs in the medical field, HuatuoGPT-Vision can lead to better tools for healthcare professionals, such as more accurate diagnostic assistants or educational resources, ultimately improving patient care and medical research.
Abstract
The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.