Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions

Jun Li, Che Liu, Wenjia Bai, Rossella Arcucci, Cosmin I. Bercea, Julia A. Schnabel

2025-03-06

Enhancing Abnormality Grounding for Vision Language Models with
Knowledge Descriptions

Summary

This paper talks about a new way to improve AI models for detecting and locating medical abnormalities in images by using detailed descriptions of medical knowledge.

What's the problem?

AI models struggle to understand complex medical terms and link them to visual features in medical images. This makes it hard for them to accurately detect and pinpoint abnormalities, especially when dealing with unseen or rare conditions.

What's the solution?

The researchers created a method that breaks down complicated medical concepts into simpler attributes and common visual patterns. By aligning these descriptions with the visual data, the AI becomes better at recognizing and locating abnormalities in medical images. They tested this approach on a smaller model and found it performed as well as much larger models, even with less training data.

Why it matters?

This matters because it helps make AI more reliable for medical imaging tasks, improving its ability to assist doctors in diagnosing diseases. It also shows promise for detecting rare or new conditions, which could lead to faster and more accurate healthcare solutions.

Abstract

Visual Language Models (VLMs) have demonstrated impressive capabilities in visual grounding tasks. However, their effectiveness in the medical domain, particularly for abnormality detection and localization within medical images, remains underexplored. A major challenge is the complex and abstract nature of medical terminology, which makes it difficult to directly associate pathological anomaly terms with their corresponding visual features. In this work, we introduce a novel approach to enhance VLM performance in medical abnormality detection and localization by leveraging decomposed medical knowledge. Instead of directly prompting models to recognize specific abnormalities, we focus on breaking down medical concepts into fundamental attributes and common visual patterns. This strategy promotes a stronger alignment between textual descriptions and visual features, improving both the recognition and localization of abnormalities in medical images.We evaluate our method on the 0.23B Florence-2 base model and demonstrate that it achieves comparable performance in abnormality grounding to significantly larger 7B LLaVA-based medical VLMs, despite being trained on only 1.5% of the data used for such models. Experimental results also demonstrate the effectiveness of our approach in both known and previously unseen abnormalities, suggesting its strong generalization capabilities.

View Paper