SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

Ming Zhao, Wenhui Dong, Yang Zhang, Xiang Zheng, Zhonghao Zhang, Zian Zhou, Yunzhi Guan, Liukun Xu, Wei Peng, Zhaoyang Gong, Zhicheng Zhang, Dachuan Li, Xiaosheng Ma, Yuli Ma, Jianing Ni, Changjiang Jiang, Lixia Tian, Qixin Chen, Kaishun Xia, Pingping Liu, Tongshun Zhang, Zhiqiang Liu

2025-10-06

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

Summary

This paper introduces a new resource called SpineMed, designed to help artificial intelligence better diagnose and understand spine disorders using medical images like X-rays and MRIs.

What's the problem?

Diagnosing spine problems is really complex because doctors need to look at different types of scans (X-ray, CT, MRI) and pinpoint the exact location on the spine where the issue is. Currently, AI struggles with this because there aren't enough good datasets that are specifically designed to teach AI to understand these scans at each individual vertebral level, and to reason about them like a doctor would. Existing data lacks detailed explanations and standardized ways to test AI's performance.

What's the solution?

The researchers created SpineMed, which includes a huge dataset called SpineMed-450k with over 450,000 examples of questions and answers related to spine images, covering different levels of the spine. They built this dataset by combining information from textbooks, medical guidelines, existing datasets, and real patient cases, and had doctors review it to ensure it was accurate. They also created SpineBench, a way to rigorously test how well AI models can identify spinal levels, assess problems, and even help with surgical planning. They then used this dataset to train an AI model and showed it performed much better than existing models at understanding detailed spine issues.

Why it matters?

This work is important because spine disorders are incredibly common and disabling. By providing a better dataset and testing framework, this research helps move AI closer to being a useful tool for doctors, potentially leading to faster and more accurate diagnoses, and ultimately better treatment for patients with spine problems.

Abstract

Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model's outputs.

View Paper