EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence

Chaoyin She, Ruifang Lu, Lida Chen, Wei Wang, Qinghua Huang

2025-09-19

EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence

Summary

This paper introduces a new artificial intelligence model, EchoVLM, designed to help doctors interpret ultrasound images more accurately and efficiently.

What's the problem?

Currently, ultrasound diagnosis relies heavily on a doctor’s experience, which can lead to inconsistencies and errors because it’s subjective. While AI models called vision-language models show promise in helping with this, existing general AI aren’t very good at understanding the specifics of ultrasound images, especially when looking for problems in different organs or handling multiple diagnostic tasks at once. They lack specialized knowledge in this area.

What's the solution?

The researchers created EchoVLM, a vision-language model specifically trained on a large collection of ultrasound images from seven different parts of the body. It uses a special design called a 'Mixture of Experts' which allows it to handle various tasks like writing ultrasound reports, making diagnoses, and answering questions about the images. Essentially, it’s like having multiple specialized AI working together.

Why it matters?

EchoVLM significantly outperformed other AI models in tasks like generating ultrasound reports, meaning it can help doctors write them faster and more accurately. This is important because it has the potential to improve the accuracy of ultrasound diagnoses, leading to better patient care and making ultrasound a more reliable tool for early cancer screening.

Abstract

Ultrasound imaging has become the preferred imaging modality for early cancer screening due to its advantages of non-ionizing radiation, low cost, and real-time imaging capabilities. However, conventional ultrasound diagnosis heavily relies on physician expertise, presenting challenges of high subjectivity and low diagnostic efficiency. Vision-language models (VLMs) offer promising solutions for this issue, but existing general-purpose models demonstrate limited knowledge in ultrasound medical tasks, with poor generalization in multi-organ lesion recognition and low efficiency across multi-task diagnostics. To address these limitations, we propose EchoVLM, a vision-language model specifically designed for ultrasound medical imaging. The model employs a Mixture of Experts (MoE) architecture trained on data spanning seven anatomical regions. This design enables the model to perform multiple tasks, including ultrasound report generation, diagnosis and visual question-answering (VQA). The experimental results demonstrated that EchoVLM achieved significant improvements of 10.15 and 4.77 points in BLEU-1 scores and ROUGE-1 scores respectively compared to Qwen2-VL on the ultrasound report generation task. These findings suggest that EchoVLM has substantial potential to enhance diagnostic accuracy in ultrasound imaging, thereby providing a viable technical solution for future clinical applications. Source code and model weights are available at https://github.com/Asunatan/EchoVLM.

View Paper