BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature
Alejandro Lozano, Min Woo Sun, James Burgess, Liangyu Chen, Jeffrey J Nirschl, Jeffrey Gu, Ivan Lopez, Josiah Aklilu, Austin Wolfgang Katzer, Collin Chiu, Anita Rau, Xiaohan Wang, Yuhui Zhang, Alfred Seunghoon Song, Robert Tibshirani, Serena Yeung-Levy
2025-01-14

Summary
This paper talks about BIOMEDICA, a new tool that helps AI understand medical images and text better. The researchers created a huge collection of medical pictures with descriptions from scientific papers and used it to train AI models that can understand and work with medical information more effectively.
What's the problem?
AI models that can understand both images and text (called vision-language models or VLMs) are getting better, but they're not great at understanding medical stuff. This is because there aren't many big collections of medical images and text that AI can learn from. The medical information that does exist is often limited to specific areas, so AI can't learn about all the different parts of medicine.
What's the solution?
The researchers made BIOMEDICA, which is like a giant digital library of medical images and their descriptions. They collected over 24 million pairs of images and text from more than 6 million scientific articles. They also added extra information to help AI understand the images better. Using this huge collection, they trained new AI models called BMCA-CLIP. These models can look at medical images and understand them really well, even for types of images they weren't specifically trained on.
Why it matters?
This matters because it could make AI much better at helping doctors and researchers in many areas of medicine. The AI models trained with BIOMEDICA are really good at understanding all sorts of medical images, from skin conditions to eye problems to cell biology. They're even better than previous AI models and use less computer power. By making their work freely available, the researchers are helping other scientists improve medical AI even further. This could lead to better diagnoses, more efficient research, and ultimately, better healthcare for everyone.
Abstract
The development of vision-language models (VLMs) is driven by large-scale and diverse multimodal datasets. However, progress toward generalist biomedical VLMs is limited by the lack of annotated, publicly accessible datasets across biology and medicine. Existing efforts are restricted to narrow domains, missing the full diversity of biomedical knowledge encoded in scientific literature. To address this gap, we introduce BIOMEDICA, a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible dataset.Our framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles. Metadata and expert-guided annotations are also provided. We demonstrate the utility and accessibility of our resource by releasing BMCA-CLIP, a suite of CLIP-style models continuously pre-trained on the BIOMEDICA dataset via streaming, eliminating the need to download 27 TB of data locally.On average, our models achieve state-of-the-art performance across 40 tasks - spanning pathology, radiology, ophthalmology, dermatology, surgery, molecular biology, parasitology, and cell biology - excelling in zero-shot classification with a 6.56% average improvement (as high as 29.8% and 17.5% in dermatology and ophthalmology, respectively), and stronger image-text retrieval, all while using 10x less compute. To foster reproducibility and collaboration, we release our codebase and dataset for the broader research community.