SpineBench: AI Spine Imaging Benchmarks
- SpineBench is a set of benchmarks for evaluating AI systems in spine imaging and clinical decision-making using multimodal data and fine-grained annotations.
- It employs rigorous data curation and validation protocols with metrics like accuracy, F1 score, and Dice similarity to ensure clinical relevance.
- Its three distinct benchmarks target clinical reasoning, pathology VQA, and lumbar segmentation, addressing limitations in spine disease diagnosis and surgical planning.
SpineBench refers to several prominent benchmarks addressing the evaluation of artificial intelligence systems in spine imaging and clinical decision-making. The term encompasses three distinct resources as of 2025: (1) a clinically salient, level-aware multimodal benchmark grounded in the SpineMed-450k corpus (Zhao et al., 3 Oct 2025), (2) a large-scale visual question answering (VQA) framework for spinal pathology analysis (Zhang et al., 14 Oct 2025), and (3) the public lumbar spine segmentation challenge and dataset developed within the SPIDER project (Graaf et al., 2023). Each instantiation serves a unique purpose and user community—ranging from diagnostic-level clinical reasoning to object-level vertebral and lesion segmentation—underwritten by rigorous data curation, labeling, and evaluation protocols.
1. Benchmarks: Purposes and Clinical Motivation
Spinal disorders are a leading cause of disability globally, with over 600 million affected individuals (Zhao et al., 3 Oct 2025). Accurate diagnosis, localization, and management of these conditions require fine-grained, vertebral-level reasoning from cross-modality imaging (X-ray, CT, MRI). Conventional benchmarks either focus on segmentation (e.g., VerSe, SPARk) or generic visual question answering, but do not address the structured, level-aware, and multimodal reasoning intrinsic to real-world spine clinical workflows.
The SpineBench family was introduced to address these critical limitations:
- SpineMed-450k/SpineBench (Level-Aware Clinical Reasoning): Designed for vertebral-level, multimodal, and clinically grounded tasks, enabling the evaluation of models on axes directly relevant to surgical planning and diagnostic clarity (Zhao et al., 3 Oct 2025).
- SpineBench VQA (Pathology Diagnosis and Localization): Focused on spinal disease discrimination and lesion localization in large radiographic collections, with hard-negative distractors to reflect realistic diagnostic challenges (Zhang et al., 14 Oct 2025).
- SPIDER SpineBench (Segmentation): Emphasizes anatomical segmentation of vertebrae, intervertebral discs (IVDs), and spinal canals in lumbar MRI, providing per-level radiological gradings and a platform for algorithmic comparison (Graaf et al., 2023).
2. Datasets and Corpus Construction
SpineMed-450k and SpineBench (Clinical Reasoning)
The SpineMed-450k corpus comprises 456,748 instruction instances sourced as follows (Zhao et al., 3 Oct 2025):
- 377,000 questions and report prompts derived from canonical orthopedic textbooks and surgical guidelines;
- 61,000 items from Europe PMC case reports and question banks;
- 9,700 multimodal items curated from approximately 1,000 anonymized hospital imaging cases (X-ray, CT, MRI);
- Minor contributions from open challenges such as Spark and VerSe (~300 items).
A clinician-in-the-loop, two-stage LLM pipeline executes data generation: a vision-LLM drafts candidate items, then a secondary revision pass by LLM ensures language quality, factual correctness, and adds “trace” logs of changes. Manual review is triggered if revision edits exceed a token-threshold or if inter-rater agreement (measured by Cohen’s κ > 0.8) degrades during sampled audits.
SpineBench VQA (Diagnosis and Localization)
SpineBench for VQA integrates and standardizes image-label pairs from four public datasets (Zhang et al., 14 Oct 2025):
| Source Dataset | Imaging Modality | Key Annotation |
|---|---|---|
| BUU Spine Dataset | X-ray | Disease classification |
| CSXA | X-ray/MRI | High-resolution, diverse cohorts |
| RSNA Lumbar Classification | X-ray | Disc degeneration, stenosis |
| VinDr-SpineXR | X-ray | Detection, classification |
The final benchmark contains 64,878 QA pairs associated with 40,263 unique images, spanning 11 disease classes. Each image is labeled with a disease diagnosis; 24,615 are annotated for lesion localization. Hard negatives for multiple-choice diagnosis are generated by embedding all images through SigLIP2 and selecting visually similar but incorrect categories.
SPIDER SpineBench (Segmentation)
The SPIDER dataset comprises 447 sagittal MRI series from 218 patients, with 39 studies reserved as a hidden test set. Ground-truth segmentations were created by iterative semi-automatic labeling—a small subset annotated manually, then a 3D U-Net model trained and iteratively refined on model predictions corrected by expert annotators (Graaf et al., 2023).
3. Benchmark Tasks and Clinical Axes
SpineMed-450k/SpineBench (Clinical Reasoning Tasks)
Three clinically salient axes structure the evaluation (Zhao et al., 3 Oct 2025):
- Level Identification: Assign region of interest (ROI) to a precise vertebral level (label set: C1–C7, T1–T12, L1–L5, S1).
- Pathology Assessment: Detect, localize, and grade the presence of pathologies and assign severity using established radiological scales (e.g., Pfirrmann grades I–V for disc degeneration).
- Surgical Planning: Recommend approach (anterior/posterior), fusion levels, and instrumentation, often via simulated multi-turn consults replicating real-world surgical discussion.
SpineBench VQA
Tasks are:
- Spinal Disease Diagnosis: Multiple-choice selection among true disease and visually hard distractors, ensuring “Healthy” is always an answer option.
- Spinal Lesion Localization: Multi-choice selection of affected segments among {L1/L2, L2/L3, L3/L4, L4/L5, L5/S1}; scored strictly on exact match.
SPIDER SpineBench (Segmentation)
Tasks focus on:
- 3D Segmentation: Precise delineation of vertebrae, IVDs, and spinal canal structures in lumbar MRI.
- Per-Level Radiological Grading: Labels include Modic changes, Schmorl’s nodes, spondylolisthesis, herniation, narrowing, bulging, and Pfirrmann grading, provided in accompanying CSVs.
4. Evaluation Metrics and Quantitative Results
Metrics Across Benchmarks
SpineMed-450k/SpineBench:
- Per-level metrics: , , , for each vertebral level ,
- Macro- and micro-averaged scores,
- Composite clinical utility score combining QA and report subdimensions normalized to a 0–100 scale.
SpineBench VQA:
- Diagnosis: Simple accuracy .
- Localization: Exact-match accuracy, as well as precision, recall, and per-instance .
SPIDER SpineBench:
- Dice Similarity Coefficient (3D),
- Average Absolute Surface Distance (ASD),
- Detection rate and completeness accuracy.
Summary Table: Performance Highlights
| Benchmark | Task | Baseline Best Model* | Key Metric / Result |
|---|---|---|---|
| SpineMed-450k/SpineBench | Text QA Accuracy | SpineGPT | 89.5% (Δ+14.0pp over Qwen2.5-VL-7B) |
| Image QA Accuracy | SpineGPT | 84.5% (Δ+10.4pp over Qwen2.5-VL-7B) | |
| Report Gen. Cumulative | SpineGPT | 87.2/100 | |
| SpineBench VQA | Diagnosis (acc.), Loc. (acc) | Gemini-2.5-pro | 32.4% (diag.), 9.3% (loc. exact-match) |
| SPIDER SpineBench | Segmentation Dice | nnU-Net | 0.92 ± 0.05 (vertebrae) |
*See respective papers for further numerical details.
SpineMed-450k/SpineBench demonstrates substantial and consistent improvements in clinically grounded metrics with domain-level training. SpineBench VQA exposes severe performance bottlenecks across both generalist and medical-specialized MLLMs, with diagnosis accuracy barely above the 20% random baseline and localization accuracies as low as 4–13%. SPIDER SpineBench achieves Dice scores >0.9 for vertebrae segmentation with both U-Net and nnU-Net baselines.
5. Clinical Validation and Practical Impact
SpineMed-450k/SpineBench: Outputs were subjected to blinded review by 17 board-certified surgeons, who rated clarity, completeness, and clinical utility. High Pearson correlations (most >0.7) between LLM-automatic and expert scores validate the automated rubric. Notably, outputs from fine-tuned models exhibited precise level annotations and actionable planning steps, features absent from baseline LVLMs (Zhao et al., 3 Oct 2025).
SpineBench VQA: Model failure modes include mislabeling of normal anatomy as pathological, attribution of disease without sufficient visual evidence, and low success in multi-segment lesion localization. Injection of textual disease definitions had negligible effect, implicating visual over language bottlenecks.
SPIDER SpineBench: The continuous challenge platform with a hidden, sequestered test set facilitates reproducible head-to-head comparisons, while detailed per-structure and per-level gradings enable auxiliary use in weak supervision and multi-task settings.
6. Implications, Limitations, and Future Directions
The introduction of SpineBench benchmarks reveals pervasive limitations in current multimodal LLM and vision-language architectures when applied to the spine domain:
- Off-the-shelf MLLMs/LLMs lack the fine-grained visual reasoning required for clinically actionable diagnosis and planning.
- Robust level-aware and multimodal training, guided by clinician expertise and traceable data pipelines, is necessary for meaningful step-changes in performance.
- Segmentation models achieve high anatomical accuracy, but full clinical decision support requires integration of segmentation, recognition, and reasoning—demonstrated only in domain-tuned benchmarks (Zhao et al., 3 Oct 2025, Zhang et al., 14 Oct 2025, Graaf et al., 2023).
Future research is recommended to incorporate explicit vertebral structure annotations, deploy advanced reasoning techniques, and combine general medical knowledge with targeted spine-pathology tasks. Ongoing publicly available challenges (e.g., SPIDER Grand Challenge) provide infrastructure for continuous benchmarking and methodological innovation.
7. Related Work and Distinctions
SpineBench benchmarks are substantively distinct from general-purpose VQA, radiology benchmarks, and generic segmentation tasks due to their emphasis on:
- Level-aware, multimodal, and clinician-aligned annotation and evaluation,
- Explicit incorporation of report generation, surgical planning, and multi-turn clinical consults as tasks,
- Enforced traceability, quality control (Cohen’s κ audits), and clinical validation with board-certified specialists.
A plausible implication is that benchmarks adhering to such design principles are required for translating AI systems from radiologic pattern-recognition to clinically integrated workflow solutions.