Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation

Jiawei Mao, Yuhan Wang, Lifeng Chen, Can Zhao, Yucheng Tang, Dong Yang, Liangqiong Qu, Daguang Xu, Yuyin Zhou

2025-10-08

Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation

Summary

This paper introduces MeDiM, a new artificial intelligence model designed to understand and generate medical information from different sources like images (X-rays, scans), pathology reports, and doctor's notes, all at the same time.

What's the problem?

Currently, AI models in medicine are often built to work with only one type of data. For example, one model might analyze X-rays, while another reads text reports. This separation prevents them from learning the full picture because important clues often come from combining information across these different sources. It’s like trying to solve a puzzle with only half the pieces – it limits how well these models can truly understand and help with medical diagnosis and treatment.

What's the solution?

The researchers created MeDiM, which uses a technique called ‘discrete diffusion’ to learn a common language between images and text. It’s built on a powerful language model that already has a lot of medical knowledge. They made two key changes to this language model: they allowed it to look at information in both directions (instead of just forward) and they gave it information about the ‘diffusion’ process, which helps it generate realistic images and text. This allows MeDiM to translate between images and text, create images from text descriptions, and even generate both an image and a report at the same time.

Why it matters?

MeDiM is important because it’s a step towards creating more versatile and powerful AI tools for medicine. By being able to integrate information from multiple sources, it can potentially lead to more accurate diagnoses, better treatment plans, and a deeper understanding of diseases. The improvements in generating both images and reports together suggest it can create more complete and reliable medical information, which could ultimately improve patient care.

Abstract

Recent advances in generative medical models are constrained by modality-specific scenarios that hinder the integration of complementary evidence from imaging, pathology, and clinical notes. This fragmentation limits their evolution into foundation models that can learn and reason across the full spectrum of biomedical data. We propose MeDiM, the first medical discrete diffusion model that learns shared distributions across modalities without modality-specific components. MeDiM unifies multiple generative tasks: translating between images and text, and jointly producing image-report pairs across domains in response to prompts. Built on a discrete diffusion framework, MeDiM bridges vision and language representations through a shared probabilistic space. To enable unified and flexible medical generation, we employ a multimodal large language model (MLLM) as the diffusion backbone, leveraging its prior knowledge and cross-modal reasoning. Two key designs are introduced: (1) removing the causal attention mask for bidirectional context, and (2) injecting continuous timestep embeddings for diffusion awareness. Experiments demonstrate high-fidelity medical generation (FID 16.60 on MIMIC-CXR and FID 24.19 on PathGen) and accurate report generation (METEOR 0.2650 and 0.2580). Jointly generated image-report pairs further enhance downstream performance (plus6.43 percent BLEU-1, plus18.57 percent BLEU-2, plus31.58 percent BLEU-3, plus4.80 percent METEOR), showing that MeDiM supports coherent and clinically grounded multimodal outputs.

View Paper