MedDINOv3: How to adapt vision foundation models for medical image segmentation?

Yuheng Li, Yizhou Wu, Yuxiang Lai, Mingzhe Hu, Xiaofeng Yang

2025-09-03

MedDINOv3: How to adapt vision foundation models for medical image segmentation?

Summary

This paper introduces a new method, MedDINOv3, for improving how well computers can automatically identify and outline organs and tumors in medical scans like CT scans and MRIs.

What's the problem?

Currently, most computer programs designed to do this outlining are very specific to the type of scan or even the hospital where the scans were taken, meaning they don't work well in new situations. While powerful image recognition systems exist that were trained on everyday photos, they don't naturally translate well to medical images because medical images look very different and require a high level of precision. Existing systems also often don't perform as well as older, specialized methods designed specifically for medical imaging.

What's the solution?

The researchers took a powerful image recognition system called DINOv3 and adapted it for medical images. They improved the basic structure of the system to better handle medical scans and then retrained it using a massive collection of over 3.8 million CT scan slices. This retraining process helped the system learn to recognize important features in medical images, making it much more accurate at outlining organs and tumors.

Why it matters?

This work is important because it shows that a single, versatile computer program can be used to accurately outline structures in medical scans across different types of scans and potentially different hospitals. This could save doctors time, improve the accuracy of diagnoses, and help with planning treatments more effectively, ultimately leading to better patient care.

Abstract

Accurate segmentation of organs and tumors in CT and MRI scans is essential for diagnosis, treatment planning, and disease monitoring. While deep learning has advanced automated segmentation, most models remain task-specific, lacking generalizability across modalities and institutions. Vision foundation models (FMs) pretrained on billion-scale natural images offer powerful and transferable representations. However, adapting them to medical imaging faces two key challenges: (1) the ViT backbone of most foundation models still underperform specialized CNNs on medical image segmentation, and (2) the large domain gap between natural and medical images limits transferability. We introduce MedDINOv3, a simple and effective framework for adapting DINOv3 to medical segmentation. We first revisit plain ViTs and design a simple and effective architecture with multi-scale token aggregation. Then, we perform domain-adaptive pretraining on CT-3M, a curated collection of 3.87M axial CT slices, using a multi-stage DINOv3 recipe to learn robust dense features. MedDINOv3 matches or exceeds state-of-the-art performance across four segmentation benchmarks, demonstrating the potential of vision foundation models as unified backbones for medical image segmentation. The code is available at https://github.com/ricklisz/MedDINOv3.

View Paper