SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning
Melanie Rieff, Maya Varma, Ossian Rabow, Subathra Adithan, Julie Kim, Ken Chang, Hannah Lee, Nidhi Rohatgi, Christian Bluethgen, Mohamed S. Muneer, Jean-Benoit Delbrouck, Michael Moor
2025-06-30
Summary
This paper talks about SMMILE, the first expert-driven benchmark that tests how well multimodal large language models (MLLMs) can learn from examples involving both images and text in medical tasks.
What's the problem?
Even though MLLMs have made progress in answering medical questions using images, they struggle to learn new tasks just from a few examples, and their performance can be very sensitive to how relevant and well-ordered these example cases are.
What's the solution?
SMMILE was created by medical experts who carefully selected problems and relevant example cases covering various medical specialties and imaging types. The benchmark tests models’ ability to use these examples to answer questions, revealing how current models often show only small improvements and are easily confused by irrelevant or poorly ordered examples.
Why it matters?
This matters because it highlights the limitations of current AI models for real medical use and provides a rigorous tool to help researchers improve how AI learns from multimodal medical data, which is essential for trustworthy and effective clinical applications.
Abstract
Current multimodal large language models show moderate to poor performance in multimodal in-context learning for medical tasks, with sensitivity to example relevance and ordering.