M3Retrieve: Benchmarking Multimodal Retrieval for Medicine

Arkadeep Acharya, Akash Ghosh, Pradeepika Verma, Kitsuchart Pasupa, Sriparna Saha, Priti Singh

2025-10-09

M3Retrieve: Benchmarking Multimodal Retrieval for Medicine

Summary

This paper introduces a new benchmark called M3Retrieve designed to test how well artificial intelligence systems can find relevant medical information when that information is presented in both text *and* images, like a doctor's report with an X-ray.

What's the problem?

Currently, there isn't a standardized way to measure how good AI models are at searching for information in the medical field when dealing with both text and images. Existing benchmarks don't focus on this specific combination, making it hard to compare different models and track progress in building better medical AI tools. Medical information is often presented in multiple formats, so this is a significant limitation.

What's the solution?

The researchers created M3Retrieve, a large collection of over 1.2 million medical documents and 164,000 search requests that include both text and images. This benchmark covers five broad areas of medicine and sixteen specific medical specialties, and it tests models on four different types of search tasks. They then tested several existing AI models on this benchmark to see how they performed and identify areas where they struggle.

Why it matters?

Developing AI that can accurately retrieve medical information from various sources is crucial for improving healthcare. This benchmark will help researchers systematically improve these AI systems, leading to better tools for doctors, more accurate diagnoses, and ultimately, better patient care. By providing a common standard for evaluation, it encourages innovation and faster progress in the field.

Abstract

With the increasing use of RetrievalAugmented Generation (RAG), strong retrieval models have become more important than ever. In healthcare, multimodal retrieval models that combine information from both text and images offer major advantages for many downstream tasks such as question answering, cross-modal retrieval, and multimodal summarization, since medical data often includes both formats. However, there is currently no standard benchmark to evaluate how well these models perform in medical settings. To address this gap, we introduce M3Retrieve, a Multimodal Medical Retrieval Benchmark. M3Retrieve, spans 5 domains,16 medical fields, and 4 distinct tasks, with over 1.2 Million text documents and 164K multimodal queries, all collected under approved licenses. We evaluate leading multimodal retrieval models on this benchmark to explore the challenges specific to different medical specialities and to understand their impact on retrieval performance. By releasing M3Retrieve, we aim to enable systematic evaluation, foster model innovation, and accelerate research toward building more capable and reliable multimodal retrieval systems for medical applications. The dataset and the baselines code are available in this github page https://github.com/AkashGhosh/M3Retrieve.

View Paper