Medal S: Spatio-Textual Prompt Model for Medical Segmentation
Pengcheng Shi, Jiawei Chen, Jiaqi Liu, Xinglin Zhang, Tao Chen, Lei Li
2025-11-20
Summary
This paper introduces Medal S, a new computer model designed to automatically identify and outline different structures within medical images like CT scans, MRIs, and ultrasounds. It's a big step forward in medical image analysis because it can understand both what a doctor *tells* it to look for (through text) and *where* to look for it (directly in the image), all at the original image quality.
What's the problem?
Existing methods for medical image segmentation often struggle with accuracy and speed. Some rely only on text descriptions, which can be vague and don't account for the specific location of things in the image. Others process images at lower resolutions to save time, losing important details. Also, analyzing many different structures (like all the organs in a scan) can be very slow and require a lot of computing power.
What's the solution?
Medal S solves these problems by combining text and spatial information directly within the model. It aligns the text instructions with the image data in a way that preserves the original image quality. It also uses a clever technique called 'parallel spatial prompting' to speed up the process of identifying multiple structures at once. They also improved how the model handles data imbalances and optimized the model's processing steps to use memory and time more efficiently.
Why it matters?
This research is important because it makes medical image analysis faster and more accurate. This could help doctors diagnose diseases earlier and more reliably, leading to better patient care. The model's ability to handle a wide range of image types and structures makes it a versatile tool for many different medical applications, and making the code publicly available allows other researchers to build upon this work.
Abstract
We introduce Medal S, a medical segmentation foundation model that supports native-resolution spatial and textual prompts within an end-to-end trainable framework. Unlike text-only methods lacking spatial awareness, Medal S achieves channel-wise alignment between volumetric prompts and text embeddings, mitigating inaccuracies from resolution mismatches. By preserving full 3D context, it efficiently processes multiple native-resolution masks in parallel, enhancing multi-class segmentation performance. A lightweight 3D convolutional module enables precise voxel-space refinement guided by both prompt types, supporting up to 243 classes across CT, MRI, PET, ultrasound, and microscopy modalities in the BiomedSegFM dataset. Medal S offers two prompting modes: a text-only mode, where model predictions serve as spatial prompts for self-refinement without human input, and a hybrid mode, incorporating manual annotations for enhanced flexibility. For 24-class segmentation, parallel spatial prompting reduces inference time by more than 90% compared to sequential prompting. We propose dynamic resampling to address target-patch ratio imbalance, extending SAT and nnU-Net for data augmentation. Furthermore, we develop optimized text preprocessing, a two-stage inference strategy, and post-processing techniques to improve memory efficiency, precision, and inference speed. On the five-modality average on the validation set, Medal S outperforms SAT with a DSC of 75.44 (vs. 69.83), NSD of 77.34 (vs. 71.06), F1 of 38.24 (vs. 24.88), and DSC TP of 65.46 (vs. 46.97). Medal S achieves excellent performance by harmonizing spatial precision with semantic textual guidance, demonstrating superior efficiency and accuracy in multi-class medical segmentation tasks compared to sequential prompt-based approaches. Medal S will be publicly available at https://github.com/yinghemedical/Medal-S.