How Far Are Surgeons from Surgical World Models? A Pilot Study on Zero-shot Surgical Video Generation with Expert Assessment

Zhen Chen, Qing Xu, Jinlin Wu, Biao Yang, Yuhao Zhai, Geng Guo, Jing Zhang, Yinlu Ding, Nassir Navab, Jiebo Luo

2025-11-04

How Far Are Surgeons from Surgical World Models? A Pilot Study on Zero-shot Surgical Video Generation with Expert Assessment

Summary

This research investigates how well current AI video generation models, specifically those called 'foundation models', can understand and realistically simulate surgical procedures. These models are good at creating videos that *look* real, but this paper questions if they actually understand the *why* behind surgical actions.

What's the problem?

While AI is getting good at making realistic videos of the physical world, it hasn't been tested in specialized fields like surgery. Surgery isn't just about general physics; it requires a deep understanding of anatomy, how instruments interact with tissues, and the overall strategy a surgeon uses. The problem is that we don't know if these AI models can handle this level of complex, causal reasoning, or if they just create visually appealing but ultimately incorrect simulations.

What's the solution?

The researchers created a new benchmark called SurgVeo, which is a collection of surgical videos used to test AI models. They also developed a 'Surgical Plausibility Pyramid' – a way to evaluate the videos on four levels, from how realistic they look to how well they demonstrate actual surgical skill and understanding. They then used a powerful video generation model, Veo-3, to predict what would happen in surgical clips and had real surgeons judge the results using the Pyramid.

Why it matters?

The study found a significant gap between how realistic the AI-generated videos *appear* and how medically plausible they actually are. The AI could create videos that looked like surgery, but it often failed to show correct instrument use, understand how the body would react, or demonstrate a logical surgical plan. This highlights that simply making things *look* real isn't enough for AI to be useful in high-stakes medical situations and provides a roadmap for future AI development in healthcare, focusing on true understanding rather than just visual mimicry.

Abstract

Foundation models in video generation are demonstrating remarkable capabilities as potential world models for simulating the physical world. However, their application in high-stakes domains like surgery, which demand deep, specialized causal knowledge rather than general physical rules, remains a critical unexplored gap. To systematically address this challenge, we present SurgVeo, the first expert-curated benchmark for video generation model evaluation in surgery, and the Surgical Plausibility Pyramid (SPP), a novel, four-tiered framework tailored to assess model outputs from basic appearance to complex surgical strategy. On the basis of the SurgVeo benchmark, we task the advanced Veo-3 model with a zero-shot prediction task on surgical clips from laparoscopic and neurosurgical procedures. A panel of four board-certified surgeons evaluates the generated videos according to the SPP. Our results reveal a distinct "plausibility gap": while Veo-3 achieves exceptional Visual Perceptual Plausibility, it fails critically at higher levels of the SPP, including Instrument Operation Plausibility, Environment Feedback Plausibility, and Surgical Intent Plausibility. This work provides the first quantitative evidence of the chasm between visually convincing mimicry and causal understanding in surgical AI. Our findings from SurgVeo and the SPP establish a crucial foundation and roadmap for developing future models capable of navigating the complexities of specialized, real-world healthcare domains.

View Paper