LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models

Ci-Siang Lin, Min-Hung Chen, Yu-Yang Sheng, Yu-Chiang Frank Wang

2025-10-06

LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models

Summary

This paper introduces a new method, LEAML, for improving how well large AI models that understand both images and text perform on specialized tasks like analyzing medical images or sports videos.

What's the problem?

Current AI models are really good at general image tasks, but they struggle when applied to specific areas like medical imaging because getting enough labeled examples to train them is difficult and expensive. These models need to learn from limited data in these specialized fields, and often don't perform well when faced with images that are different from what they were originally trained on.

What's the solution?

LEAML tackles this by cleverly using both a small amount of labeled image-question-answer data and a larger amount of unlabeled images. It creates realistic questions and answers for the unlabeled images, essentially teaching the AI what to look for. The key is that it only updates the parts of the AI model that are most important for answering questions, making the learning process more efficient and focused on the specific domain.

Why it matters?

This research is important because it offers a way to adapt powerful AI models to specialized fields without needing massive amounts of labeled data. This is particularly useful in areas like healthcare, where getting labeled medical images is a major challenge, and could lead to better and more accessible diagnostic tools.

Abstract

Multimodal Large Language Models (MLLMs) have achieved strong performance on general visual benchmarks but struggle with out-of-distribution (OOD) tasks in specialized domains such as medical imaging, where labeled data is limited and expensive. We introduce LEAML, a label-efficient adaptation framework that leverages both scarce labeled VQA samples and abundant unlabeled images. Our approach generates domain-relevant pseudo question-answer pairs for unlabeled data using a QA generator regularized by caption distillation. Importantly, we selectively update only those neurons most relevant to question-answering, enabling the QA Generator to efficiently acquire domain-specific knowledge during distillation. Experiments on gastrointestinal endoscopy and sports VQA demonstrate that LEAML consistently outperforms standard fine-tuning under minimal supervision, highlighting the effectiveness of our proposed LEAML framework.

View Paper