MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies

Weixiang Shen, Yanzhu Hu, Che Liu, Junde Wu, Jiayuan Zhu, Chengzhi Shen, Min Xu, Yueming Jin, Benedikt Wiestler, Daniel Rueckert, Jiazhen Pan

2026-03-30

MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies

Summary

This paper introduces a new way to test how well artificial intelligence, specifically vision-language models, can analyze medical images like brain scans and CT scans. It moves beyond simply showing the AI a single picture and instead challenges it to work with full 3D medical studies, just like a doctor would.

What's the problem?

Currently, testing these AI models in medical imaging is too simple. Researchers usually give the AI pre-selected 2D images, which requires a lot of human effort to prepare. This doesn't reflect how doctors actually work – they need to look through entire 3D scans, often with different types of images, to make a diagnosis. The existing methods don't test if the AI can truly navigate and understand complex medical data.

What's the solution?

The researchers created two main things: MEDOPENCLAW, which is a system that lets AI models interact with standard medical imaging software like a doctor would, and MEDFLOWBENCH, a large collection of real medical scans (brain MRIs and lung CT/PET scans) designed to test the AI’s abilities. They tested powerful AI models like Gemini and GPT-5.4 to see how well they could navigate the scans and use tools within the software. Surprisingly, they found that giving the AI access to professional tools actually *reduced* its performance, likely because it struggled to pinpoint exact locations within the 3D images.

Why it matters?

This work is important because it provides a more realistic and challenging way to evaluate AI in medical imaging. By simulating a real clinical workflow, it helps researchers identify weaknesses in current AI models and develop better, more reliable systems that can actually assist doctors in making diagnoses. It also creates a standard benchmark for comparing different AI models and tracking progress in the field.

Abstract

Currently, evaluating vision-language models (VLMs) in medical imaging tasks oversimplifies clinical reality by relying on pre-selected 2D images that demand significant manual labor to curate. This setup misses the core challenge of realworld diagnostics: a true clinical agent must actively navigate full 3D volumes across multiple sequences or modalities to gather evidence and ultimately support a final decision. To address this, we propose MEDOPENCLAW, an auditable runtime designed to let VLMs operate dynamically within standard medical tools or viewers (e.g., 3D Slicer). On top of this runtime, we introduce MEDFLOWBENCH, a full-study medical imaging benchmark covering multi-sequence brain MRI and lung CT/PET. It systematically evaluates medical agentic capabilities across viewer-only, tool-use, and open-method tracks. Initial results reveal a critical insight: while state-of-the-art LLMs/VLMs (e.g., Gemini 3.1 Pro and GPT-5.4) can successfully navigate the viewer to solve basic study-level tasks, their performance paradoxically degrades when given access to professional support tools due to a lack of precise spatial grounding. By bridging the gap between static-image perception and interactive clinical workflows, MEDOPENCLAW and MEDFLOWBENCH establish a reproducible foundation for developing auditable, full-study medical imaging agents.

View Paper