ACE-LoRA: Graph-Attentive Context Enhancement for Parameter-Efficient Adaptation of Medical Vision-Language Models
M. Arda Aydın, Melih B. Yilmaz, Aykut Koç, Tolga Çukur
2026-03-19
Summary
This paper introduces a new method, ACE-LoRA, to improve how well vision-language models work in the medical field. These models are good at understanding both images and text, but existing medical models either get really good at one specific type of medical image but can't handle others, or they're okay at everything but don't have enough detail to be truly useful for diagnosis.
What's the problem?
Current medical vision-language models face a trade-off. You can train a model to be a specialist, really good at recognizing things in, say, X-rays, but it will struggle with CT scans. Or, you can train a generalist model that understands many types of medical images, but it misses the subtle details doctors need to make accurate diagnoses. Existing methods for adapting these models don't effectively capture those important, fine-grained details.
What's the solution?
ACE-LoRA solves this by taking a general medical vision-language model and making small, targeted changes to it. It uses a technique called LoRA to efficiently update the model without retraining everything. Crucially, it adds a new component called ACE-HGNN which helps the model understand how different parts of an image relate to each other, capturing those important diagnostic details. It also improves how the model connects images and their descriptions, making sure it doesn't miss important connections. All of this is done with a relatively small number of extra parameters, making it efficient.
Why it matters?
This work is important because it allows medical vision-language models to be both general enough to work across different types of medical images *and* specific enough to provide the detailed information doctors need. By improving performance on tasks like classifying, segmenting, and detecting things in medical images, ACE-LoRA has the potential to help with more accurate and faster diagnoses.
Abstract
The success of CLIP-like vision-language models (VLMs) on natural images has inspired medical counterparts, yet existing approaches largely fall into two extremes: specialist models trained on single-domain data, which capture domain-specific details but generalize poorly, and generalist medical VLMs trained on multi-domain data, which retain broad semantics but dilute fine-grained diagnostic cues. Bridging this specialization-generalization trade-off remains challenging. To address this problem, we propose ACE-LoRA, a parameter-efficient adaptation framework for generalist medical VLMs that maintains robust zero-shot generalization. ACE-LoRA integrates Low-Rank Adaptation (LoRA) modules into frozen image-text encoders and introduces an Attention-based Context Enhancement Hypergraph Neural Network (ACE-HGNN) module that captures higher-order contextual interactions beyond pairwise similarity to enrich global representations with localized diagnostic cues, addressing a key limitation of prior Parameter-Efficient Fine-Tuning (PEFT) methods that overlook fine-grained details. To further enhance cross-modal alignment, we formulate a label-guided InfoNCE loss to effectively suppress false negatives between semantically related image-text pairs. Despite adding only 0.95M trainable parameters, ACE-LoRA consistently outperforms state-of-the-art medical VLMs and PEFT baselines across zero-shot classification, segmentation, and detection benchmarks spanning multiple domains. Our code is available at https://github.com/icon-lab/ACE-LoRA.