SpineMed: AI-Assisted Spinal Imaging Ecosystem

Updated 3 July 2026

SpineMed is a clinically grounded AI ecosystem that offers a vertebral-level–aware, multimodal dataset and comprehensive evaluation framework.
The SpineMed-450k dataset contains over 450k annotated instances across imaging modalities and clinical subcategories, ensuring traceability and high-quality curation.
The SpineBench framework provides robust benchmarks for pathology assessment, surgical planning, and automated report generation, significantly enhancing diagnostic accuracy.

SpineMed is an ecosystem for clinically grounded, vertebral-level–aware AI in spinal imaging, comprising the large-scale, provenance-traceable multimodal SpineMed-450k dataset and the SpineBench evaluation framework. It directly addresses gaps in clinically reliable instruction data, standardized benchmarks, and evaluation tools for AI-assisted diagnosis and decision-making in spinal disorders. Co-developed with practicing spine surgeons, the SpineMed suite targets advanced reasoning across X-ray, CT, and MRI, focusing on vertebral-level granularity and clinical decision axes such as pathology assessment and surgical planning (Zhao et al., 3 Oct 2025).

1. Dataset Design and Clinician-in-the-Loop Curation

SpineMed-450k is the first dataset explicitly curated for vertebral-level, multimodal clinical reasoning. It comprises over 450,000 instruction instances, each annotated with traceable provenance, spanning:

Imaging modalities: X-ray, CT, MRI
Vertebral coverage: C1 through sacrum, with 14 clinical subcategories (cervical degenerative disease, idiopathic scoliosis, trauma, deformities, etc.)
Task variety: multiple-choice QA, open-ended QA, multi-turn simulated doctor–patient dialogues, and structured six-section report generation

The curation pipeline includes:

Collection & Preprocessing:
- Text and image extraction via OCR (PaddleOCR), with local context matching
De-identification & Cleaning:
- HIPAA-style removal of all patient identifiers and non-orthopedic content filtering using expert LLM
Annotation Generation (Two-Stage LLM):
- Draft phase: VLM-initialized QA/dialogue/report items
- Revision phase: explicit “revise-and-log” prompting, storing all edits for traceability
Final Clinician Review:
- Random spot-checks and policy audits ensure annotation quality

Quality is assessed via Cohen’s κ agreement (clinician vs. LLM revision, ≈0.82) and data consistency (ratio of unaltered key facts post-revision, averaging 0.91 across modalities). Every entry is linked to a dataset ID, DOI, or case identifier, enabling full traceability and source validation (Zhao et al., 3 Oct 2025).

2. The SpineBench Evaluation Framework

SpineBench is a clinically salient benchmarking protocol that evaluates models on clinically relevant axes using multimodal imaging, with the following task structure:

Level Identification: Predict vertebral index $\hat{\ell}_i \in \{C1, ..., S1\}$ given multimodal input $x_i$ .
Pathology Assessment: Classify lesion type and severity $y_i$ (herniation, stenosis, etc.).
Surgical Planning: Select or generate evidence-based intervention plans.

For classification, SpineBench uses:

Accuracy: $\mathrm{Acc} = \frac{\mathrm{TP} + \mathrm{TN}}{\mathrm{TP} + \mathrm{TN} + \mathrm{FP} + \mathrm{FN}}$
Precision: $\mathrm{P} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}$
Recall: $\mathrm{R} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}$
F1-score: $F_1 = 2\cdot \frac{P\,R}{P + R}$
Area under the ROC curve (AUC), binary and multi-class

Report generation performance combines five expert-weighted diagnostic sections into a composite score:

$S = \sum_{i=1}^5 w_i\,\bar s_i, \quad \bar s_i = \frac{1}{n_i} \sum_{j=1}^{n_i} s_{ij}$

where $s_{ij} \in [1,5]$ are calibrated sub-scores per section (Zhao et al., 3 Oct 2025).

3. Model Benchmarking and Experimental Findings

Contemporary large vision-LLMs (LVLMs) exhibit systematic failures in fine-grained, vertebral-level reasoning and surgical decision emulation when tested on SpineBench. Representative results include:

Model	Close-Ended QA (%)	Report Score (0–100)
Gemini-2.5-Pro	88.50	93.3
GLM-4.5V (open src)	83.98	79.2
Qwen2.5-VL-72B	82.75	63.80
SpineGPT (SpineMed)	87.89	87.24

Statistically significant gains were observed after fine-tuning on SpineMed-450k. Against GLM-4.5V, $\Delta\mathrm{Acc}_\mathrm{QA} = 3.91\%$ and $x_i$ 0 (paired $x_i$ 1-test, $x_i$ 2), with 95% CI $x_i$ 3, highlighting the effect of level-aware, traceable training (Zhao et al., 3 Oct 2025).

4. Level-Aware Clinical Reasoning and Utility

Traditional models frequently mislocalize spine pathology by one or more vertebral levels, adversely impacting diagnoses and downstream intervention planning. Training on level-labeled instances from SpineMed-450k reduced misidentification errors from approximately 21% to 9%. In clinical assessment, model-generated reports attained a mean clarity rating of 4.6/5 from 17 board-certified surgeons, and 94% of plans were deemed “clinically actionable.” The outputs supply explicit, guideline-consistent recommendations, e.g., for L4–L5 herniation and surgical risk stratification (Zhao et al., 3 Oct 2025).

5. Integration with Quantitative Measurement and Segmentation Pipelines

SpineMed can leverage advanced automated measurement frameworks such as the Cascade Amplifier Regression Network (CARN) for rapid extraction of clinically relevant indices (e.g., vertebral and disc heights, total MAE ≈1.27 mm), improving reproducibility and objectivity for osteoporosis and disc-degeneration assessment (Pang et al., 2018). For segmentation and morphometry, multimodal fusion models like ATM-Net achieve superior level-wise segmentation (Dice 81.72% vs. 78.84% baseline on MRSpineSeg) via anatomy-aware text-guided fusion and channel-wise contrastive learning, enabling fine-grained morphometric tracking at the substructure level (Lian et al., 4 Apr 2025).

3D denoising diffusion frameworks further enable accurate MRI-to-CT image translation, unlocking established CT-based spinal segmentation pipelines for MRI data and providing high-fidelity, artifact-resistant segmentations of small structures (e.g., spinous process) – a critical need for level tracking and surgical navigation (Graf et al., 2023).

6. Clinical Applications and Decision-Support Integration

SpineMed’s level-specific data supports downstream clinical applications, including:

Automated vertebral level counting via force-ultrasound data fusion, achieving 100% detection accuracy for vertebral labeling in controlled tests (vs. 80–90% for single-sensor systems), suitable for radiation-free interventional guidance (Tirindelli et al., 2020).
Seamless report generation workflows, simulated multi-turn consultations, and deployment as a REST/GPU microservice for real-time inference (<0.1s per exam in representative tasks) (Zhao et al., 3 Oct 2025, He et al., 2021).
Hybrid rule-based expert systems for symptom- and imaging-derived diagnosis, incorporating certainty-based backward chaining for anomaly detection and risk stratification, with >92% sensitivity across major spinal disorders in clinical validations (Dashti et al., 2023).

7. Outlook and Limitations

SpineMed establishes the first large-scale, multimodal, provenance-rich resource for clinically relevant, fine-grained evaluation of AI systems in spinal care. It addresses fundamental limitations in existing medical AI by enforcing traceability, level granularity, and data quality. Outstanding challenges include extending 3D and whole-axial coverage, ensuring generalization across scanner and protocol variance, and clinical validation in diverse cohorts (including pediatric, post-traumatic, or pathological variants). Integration with high-precision segmentation, automated morphometry, and robust clinical decision-support systems positions SpineMed as the foundational ecosystem for advancing AI in spine medicine (Zhao et al., 3 Oct 2025).