Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

Zhongying Deng, Cheng Tang, Ziyan Huang, Jiashi Lin, Ying Chen, Junzhi Ning, Chenglong Ma, Jiyao Liu, Wei Li, Yinghao Zhu, Shujian Gao, Yanyan Huang, Sibo Ju, Yanzhou Su, Pengcheng Chen, Wenhao Tang, Tianbin Li, Haoyu Wang, Yuanfeng Ji, Hui Sun, Shaobo Min, Liang Peng

2026-04-01

Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

Summary

This paper is about the challenges of building powerful AI models for medical images, and it provides a large collection of information about existing medical image datasets along with a way to combine them.

What's the problem?

Creating really good AI for medical images needs a *lot* of data, but getting that data is hard. Medical images are sensitive, require expert doctors to label correctly, and aren't easily shared. This means there aren't many large, unified collections of medical images available, which slows down the development of advanced AI tools for doctors.

What's the solution?

The researchers created a detailed catalog of over 1,000 publicly available medical image datasets, noting what kind of images they are, what body parts they show, and what tasks they're useful for. They also developed a method, called MDFP, to automatically combine these smaller datasets that have similar types of images or are used for similar tasks, effectively creating larger datasets. Finally, they built a website where anyone can easily find and combine these datasets.

Why it matters?

This work is important because it provides a roadmap for building better AI for medical imaging. By making it easier to find, share, and combine medical image data, it will speed up the development of more accurate and reliable AI tools that can help doctors diagnose diseases and improve patient care.

Abstract

Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large-scale, diverse, and high-quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of large-scale unified medical datasets and hindering the development of powerful medical foundation models. In this work, we present the largest survey to date of medical image datasets, covering over 1,000 open-access datasets with a systematic catalog of their modalities, tasks, anatomies, annotations, limitations, and potential for integration. Our analysis exposes a landscape that is modest in scale, fragmented across narrowly scoped tasks, and unevenly distributed across organs and modalities, which in turn limits the utility of existing medical image datasets for developing versatile and robust medical foundation models. To turn fragmentation into scale, we propose a metadata-driven fusion paradigm (MDFP) that integrates public datasets with shared modalities or tasks, thereby transforming multiple small data silos into larger, more coherent resources. Building on MDFP, we release an interactive discovery portal that enables end-to-end, automated medical image dataset integration, and compile all surveyed datasets into a unified, structured table that clearly summarizes their key characteristics and provides reference links, offering the community an accessible and comprehensive repository. By charting the current terrain and offering a principled path to dataset consolidation, our survey provides a practical roadmap for scaling medical imaging corpora, supporting faster data discovery, more principled dataset creation, and more capable medical foundation models.

View Paper