MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

Peng Xia, Kangyu Zhu, Haoran Li, Tianze Wang, Weijia Shi, Sheng Wang, Linjun Zhang, James Zou, Huaxiu Yao

2024-10-18

MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

Summary

This paper introduces MMed-RAG, a new system designed to improve how AI models understand and generate medical information by combining visual data and language processing.

What's the problem?

AI models in healthcare often struggle with providing accurate diagnoses because they can sometimes generate incorrect information, known as factual hallucination. This happens when the model gives answers that sound right but are not based on actual data. Additionally, existing methods to improve these models either require a lot of high-quality data or do not work well across different medical fields, making them less effective.

What's the solution?

To tackle these issues, the authors developed MMed-RAG, which enhances the factual accuracy of medical AI models. It includes three main features: a domain-aware retrieval mechanism that helps the model select the right information based on the specific medical field (like radiology or ophthalmology), an adaptive method for choosing relevant context from the data, and a preference fine-tuning strategy that aligns the model's outputs with accurate medical knowledge. These innovations allow MMed-RAG to improve its understanding and generation of medical information significantly.

Why it matters?

This research is important because it makes AI systems in healthcare more reliable and accurate, which is crucial for patient safety. By improving how these models process and generate medical information, MMed-RAG can help doctors make better diagnoses and treatment plans. This advancement could lead to better patient outcomes and more effective use of AI in medicine.

Abstract

Artificial Intelligence (AI) has demonstrated significant potential in healthcare, particularly in disease diagnosis and treatment planning. Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools. However, these models often suffer from factual hallucination, which can lead to incorrect diagnoses. Fine-tuning and retrieval-augmented generation (RAG) have emerged as methods to address these issues. However, the amount of high-quality data and distribution shifts between training data and deployment data limit the application of fine-tuning methods. Although RAG is lightweight and effective, existing RAG-based approaches are not sufficiently general to different medical domains and can potentially cause misalignment issues, both between modalities and between the model and the ground truth. In this paper, we propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs. Our approach introduces a domain-aware retrieval mechanism, an adaptive retrieved contexts selection method, and a provable RAG-based preference fine-tuning strategy. These innovations make the RAG process sufficiently general and reliable, significantly improving alignment when introducing retrieved contexts. Experimental results across five medical datasets (involving radiology, ophthalmology, pathology) on medical VQA and report generation demonstrate that MMed-RAG can achieve an average improvement of 43.8% in the factual accuracy of Med-LVLMs. Our data and code are available in https://github.com/richard-peng-xia/MMed-RAG.

View Paper