RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models

Peng Xia, Kangyu Zhu, Haoran Li, Hongtu Zhu, Yun Li, Gang Li, Linjun Zhang, Huaxiu Yao

2024-07-08

RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models

Summary

This paper talks about RULE, a new model designed to improve the factual accuracy of Medical Large Vision Language Models (Med-LVLMs) that help in medical diagnosis by generating text based on medical images and data.

What's the problem?

The main problem is that current Med-LVLMs often produce responses that are not factually correct, which can be dangerous in a medical context. They sometimes rely too much on external information retrieved from databases, which can lead to incorrect or irrelevant answers. Additionally, if the model initially gives a correct answer, using retrieved information can confuse it and lead to mistakes.

What's the solution?

To solve these issues, the authors developed RULE, which has two key components. First, it uses a method to carefully choose how much external information to retrieve, ensuring that only the most relevant facts are included. Second, they created a special dataset to help the model learn when to rely on its own knowledge versus when to use retrieved information. This helps balance the model's responses and improves its accuracy. The researchers tested RULE on three medical question-answering datasets and found it improved factual accuracy by an average of 20.8%.

Why it matters?

This research is important because it enhances the reliability of AI systems used in healthcare. By ensuring that these models provide accurate and trustworthy information, RULE can help improve medical diagnosis and patient care, making AI tools safer and more effective for doctors and patients alike.

Abstract

The recent emergence of Medical Large Vision Language Models (Med-LVLMs) has enhanced medical diagnosis. However, current Med-LVLMs frequently encounter factual issues, often generating responses that do not align with established medical facts. Retrieval-Augmented Generation (RAG), which utilizes external knowledge, can improve the factual accuracy of these models but introduces two major challenges. First, limited retrieved contexts might not cover all necessary information, while excessive retrieval can introduce irrelevant and inaccurate references, interfering with the model's generation. Second, in cases where the model originally responds correctly, applying RAG can lead to an over-reliance on retrieved contexts, resulting in incorrect answers. To address these issues, we propose RULE, which consists of two components. First, we introduce a provably effective strategy for controlling factuality risk through the calibrated selection of the number of retrieved contexts. Second, based on samples where over-reliance on retrieved contexts led to errors, we curate a preference dataset to fine-tune the model, balancing its dependence on inherent knowledge and retrieved contexts for generation. We demonstrate the effectiveness of RULE on three medical VQA datasets, achieving an average improvement of 20.8% in factual accuracy. We publicly release our benchmark and code in https://github.com/richard-peng-xia/RULE.

View Paper