Boosting Medical Visual Understanding From Multi-Granular Language Learning

Zihan Li, Yiqing Wang, Sina Farsiu, Paul Kinahan

2025-11-21

Boosting Medical Visual Understanding From Multi-Granular Language Learning

Summary

This paper introduces a new method, Multi-Granular Language Learning (MGLL), to improve how computers understand images and related text, especially in complex fields like medical imaging.

What's the problem?

Current image-text learning models, like CLIP, are good at matching images to single, specific labels. However, medical images often have multiple diagnoses or descriptions at different levels of detail – for example, a scan might show a disease, its severity, and a detailed explanation. Existing models struggle with this complexity because they aren't designed to handle multiple labels or different levels of detail in the text descriptions.

What's the solution?

The researchers developed MGLL, a system that's better at aligning images with multiple labels and descriptions at various levels of detail. It does this by using structured information about the labels, combining descriptions from different levels, and using a technique called 'soft-label supervision' to make the alignment more precise. They also used a mathematical tool called KL divergence to ensure consistency between the different levels of detail without making the process too slow. MGLL can be easily added to existing image-text models.

Why it matters?

This work is important because it improves the accuracy of image understanding in areas like medical imaging, where precise and detailed analysis is crucial. By handling multiple labels and levels of detail, MGLL can help doctors and researchers more effectively use computers to analyze medical scans and improve patient care.

Abstract

Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback-Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at https://github.com/HUANGLIZI/MGLL{https://github.com/HUANGLIZI/MGLL}.

View Paper