Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis

Junzhi Ning, Wei Li, Cheng Tang, Jiashi Lin, Chenglong Ma, Chaoyang Zhang, Jiyao Liu, Ying Chen, Shujian Gao, Lihao Liu, Yuandong Pu, Huihui Xu, Chenhui Gou, Ziyan Huang, Yi Xin, Qi Qin, Zhongying Deng, Diping Song, Bin Fu, Guang Yang, Yuanfeng Ji, Tianbin Li

2025-10-22

Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis

Summary

This paper introduces a new system, UniMedVL, designed to handle various medical tasks using both images and text at the same time. It aims to bridge the gap between AI that *understands* medical images and AI that *creates* medical images, allowing for a more complete diagnostic process.

What's the problem?

Currently, medical AI is split into two main areas: understanding images (like identifying a tumor) and generating images (like creating a synthetic X-ray). Models that do one can't usually do the other. This separation causes problems because real medical diagnosis requires both – interpreting what you see *and* being able to visualize potential issues or outcomes. It limits how well AI can represent complex medical data, combine different types of information, and perform multiple tasks effectively.

What's the solution?

The researchers created a framework inspired by how doctors actually work – observing, gathering knowledge, and then analyzing. First, they built a large dataset called UniMed-5M with over 5.6 million examples pairing different types of medical data. Then, they used a technique called Progressive Curriculum Learning to gradually teach the AI about medical concepts. Finally, they developed UniMedVL, a single AI model that can both understand medical images *and* generate them, all within the same system. This model excels at understanding images and performs as well as specialized models at generating them.

Why it matters?

This research is important because it shows that combining image understanding and generation into one AI system improves performance on a variety of medical tasks. By allowing the AI to learn from both analyzing and creating images, it gains a more comprehensive understanding of medical data, leading to better results in diagnosis and treatment planning. It’s a step towards more versatile and powerful medical AI tools.

Abstract

Medical diagnostic applications require models that can process multimodal medical inputs (images, patient histories, lab results) and generate diverse outputs including both textual reports and visual content (annotations, segmentation masks, and images). Despite this need, existing medical AI systems disrupt this unified process: medical image understanding models interpret images but cannot generate visual outputs, while medical image generation models synthesize images but cannot provide textual explanations. This leads to gaps in data representation, feature integration, and task-level multimodal capabilities. To this end, we propose a multi-level framework that draws inspiration from diagnostic workflows through the Observation-Knowledge-Analysis (OKA) paradigm. Specifically, at the observation level, we construct UniMed-5M, a dataset comprising over 5.6M samples that reformat diverse unimodal data into multimodal pairs for foundational observation. At the knowledge level, we propose Progressive Curriculum Learning that systematically introduces medical multimodal knowledge. At the analysis level, we introduce UniMedVL, the first medical unified multimodal model for the simultaneous analysis of image understanding and generation tasks within a single architecture. UniMedVL achieves superior performance on five medical image understanding benchmarks, while matching specialized models in generation quality across eight medical imaging modalities. Crucially, our unified architecture enables bidirectional knowledge sharing: generation tasks enhance visual understanding features, demonstrating that integrating traditionally separate capabilities within a single medical framework unlocks improvements across diverse medical vision-language tasks. Code is available at https://github.com/uni-medical/UniMedVL.

View Paper