MedVisionLlama: Leveraging Pre-Trained Large Language Model Layers to Enhance Medical Image Segmentation
Gurucharan Marthi Krishna Kumar, Aman Chadha, Janine Mendola, Amir Shmuel
2024-10-04

Summary
This paper discusses MedVisionLlama, a new method that improves medical image segmentation by using layers from pre-trained large language models (LLMs) to enhance the performance of Vision Transformers (ViTs).
What's the problem?
Medical image segmentation is a critical task that involves identifying and outlining specific areas in medical images, such as tumors or organs. Traditional methods may struggle to accurately segment these images, especially when dealing with complex features or variations in the images. This can lead to less accurate diagnoses and treatment plans.
What's the solution?
To solve this problem, the authors propose integrating pre-trained LLM transformer blocks into ViT models. By adding a frozen LLM layer into the model's encoder, they create a hybrid model that combines the strengths of both LLMs and ViTs. They also introduce a Hybrid Attention Mechanism that allows the model to learn both global and local features effectively. Their approach leads to significant improvements in segmentation performance, as measured by various metrics like the Dice score, accuracy, and precision.
Why it matters?
This research is important because it shows how combining different types of AI models can lead to better results in medical imaging tasks. By enhancing the accuracy of medical image segmentation, this method can help healthcare professionals make more informed decisions, ultimately improving patient care and outcomes.
Abstract
Large Language Models (LLMs), known for their versatility in textual data, are increasingly being explored for their potential to enhance medical image segmentation, a crucial task for accurate diagnostic imaging. This study explores enhancing Vision Transformers (ViTs) for medical image segmentation by integrating pre-trained LLM transformer blocks. Our approach, which incorporates a frozen LLM transformer block into the encoder of a ViT-based model, leads to substantial improvements in segmentation performance across various medical imaging modalities. We propose a Hybrid Attention Mechanism that combines global and local feature learning with a Multi-Scale Fusion Block for aggregating features across different scales. The enhanced model shows significant performance gains, including an average Dice score increase from 0.74 to 0.79 and improvements in accuracy, precision, and the Jaccard Index. These results demonstrate the effectiveness of LLM-based transformers in refining medical image segmentation, highlighting their potential to significantly boost model accuracy and robustness. The source code and our implementation are available at: https://bit.ly/3zf2CVs