Papers
Topics
Authors
Recent
Search
2000 character limit reached

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

Published 3 Oct 2025 in cs.CV and cs.AI | (2510.03160v1)

Abstract: Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-LLMs (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model's outputs.

Summary

  • The paper presents SpineBench, a level-aware benchmark powered by the large-scale, multimodal SpineMed-450k corpus.
  • It introduces a rigorous, clinician-in-the-loop pipeline and a novel picture context matching algorithm for robust data alignment across imaging modalities.
  • Benchmarking reveals that domain-specific models like SpineGPT significantly outperform general LVLMs in fine-grained spinal clinical reasoning.

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

Motivation and Problem Statement

Spinal disorders represent a major global health burden, requiring nuanced, multimodal, and level-specific clinical reasoning for accurate diagnosis and management. Existing AI systems for medical imaging and clinical decision support are constrained by the lack of large-scale, traceable, and clinically validated datasets that reflect the complexity of real-world spine care. General-purpose LVLMs and even medical-specific LLMs have demonstrated limited performance in fine-grained, vertebral-level reasoning, particularly when integrating information across X-ray, CT, and MRI modalities. The absence of standardized, spine-specific benchmarks further impedes progress in this domain.

SpineMed-450k: Dataset Construction and Characteristics

The SpineMed-450k dataset is a large-scale, multimodal, and provenance-rich instruction corpus specifically designed for orthopedic and spinal diagnosis. Its construction followed a rigorous clinician-in-the-loop pipeline, ensuring clinical validity and traceability at every stage. Figure 1

Figure 1: Overview of SpineMed-450k, illustrating the multi-source, multi-stage curation process and the diversity of data types.

Data Sources and Curation

  • Sources: Textbooks, clinical guidelines, expert consensus documents, open-access case reports, public datasets (e.g., Spark, VerSe), and ~1,000 de-identified hospital cases.
  • Modalities: Text, X-ray, CT, MRI, and structured tables.
  • Data Types: Multiple-choice QA, open-ended QA, multi-turn dialogues, and structured clinical reports.
  • Annotation Pipeline: Data preprocessing (de-identification, deduplication, OCR), expert LLM-driven curation (two-stage draft and revision), and final clinician review. Figure 2

    Figure 2: Generation pipeline of SpineMed-450k, highlighting the integration of expert LLMs and clinician review for high-quality, multimodal data generation.

Structured Information Extraction

A custom pipeline leveraging PaddleOCR and a novel picture context matching algorithm was developed to maintain the integrity of multimodal data and ensure precise mapping between images, captions, and contextual text. Figure 3

Figure 3: The picture context matching algorithm for robust multimodal data alignment.

Dataset Statistics

SpineMed-450k comprises over 450,000 instruction instances, with balanced coverage across seven orthopedic subspecialties and 14 spine subconditions. The dataset is demographically and institutionally diverse, with real clinical cases sourced from 11 hospitals over three years. Figure 4

Figure 4: Statistics of SpineMed-450k, including hospital distribution, disease prevalence, modality/language breakdown, and token length distributions.

SpineBench: Benchmark Design and Evaluation Framework

SpineBench is a clinically grounded, level-aware benchmark derived from SpineMed-450k, designed to evaluate AI models on axes directly relevant to clinical practice.

Benchmark Construction

  • Sampling: 500 multiple-choice questions and 100 medical report prompts, covering 14 spinal sub-diseases and multiple data sources.
  • Validation: Triple-blind review by 17 board-certified orthopedic surgeons, ensuring high-quality, unbiased evaluation items.

Evaluation Metrics

A comprehensive, multi-dimensional scoring system was developed, integrating text-only and multimodal QA, as well as structured report generation. Report evaluation spans five key sections: imaging findings, AI-assisted diagnosis, treatment recommendations (patient and physician perspectives), risk/prognosis, and reasoning/disclaimer, each scored on a 1–5 scale with decimal granularity.

Experimental Results and Analysis

Model Training and Curriculum

The baseline model, SpineGPT, is a Qwen2.5-VL-7B-Instruct derivative, trained in three curriculum stages: (1) general/orthopedic foundational learning, (2) specialized spinal health, and (3) advanced report generation and dialogue. The ablation study demonstrates that domain-aligned orthopedic and spine-specific data are essential for high performance; general medical data alone is insufficient.

Benchmark Performance

Figure 5

Figure 5: Benchmark performance of SpineGPT, demonstrating consistent improvements over open-source and proprietary LVLMs.

  • Open-Source LVLMs: Qwen2.5-VL-72B and GLM-4.5V underperform on both QA and report generation, with pronounced deficits in multimodal tasks.
  • Proprietary LVLMs: Gemini-2.5-Pro and GPT-5 achieve higher scores, but SpineGPT narrows the gap, outperforming several proprietary models on text-only QA and achieving the highest open-source report generation scores.
  • Cross-Modal Alignment: All models exhibit a performance drop on image-based tasks, highlighting persistent challenges in vision-language integration for medical imaging.

Human-Expert Agreement

Figure 6

Figure 6: Consistency evaluation of large models and scores given by medical experts, showing strong alignment between LLM and expert assessments.

Pearson correlation coefficients between LLM and expert scores are consistently high (most >0.7), validating the reliability of the automated evaluation framework.

Qualitative Comparison: SpineGPT vs. Generalist LLMs

Figure 7

Figure 7: Comparative analysis of medical report generation for adolescent idiopathic scoliosis, showing SpineGPT's superior diagnostic depth and clinical reasoning compared to ChatGPT-4o.

Figure 8

Figure 8: SpineGPT's structured medical report output, demonstrating comprehensive coverage of imaging, diagnosis, treatment, risk, and rationale.

Figure 9

Figure 9: ChatGPT-4o's report output, illustrating the limitations of general-purpose LLMs in clinical specificity and depth.

SpineGPT produces 72% more detailed content, with quantified risk assessments, 3D surgical planning, and evidence-based protocols, while ChatGPT-4o provides only basic recommendations.

Implications and Future Directions

Practical Implications

  • Clinical Utility: SpineMed-450k and SpineBench enable the development and rigorous evaluation of AI systems capable of clinically relevant, level-aware reasoning in spine care.
  • Specialization vs. Generalization: The results underscore the necessity of domain-specific, provenance-rich instruction data for high-stakes clinical applications, as generalist models fail to meet the required granularity and accuracy.
  • Automated Evaluation: The strong alignment between LLM-based and expert scoring supports scalable, reproducible benchmarking in medical AI.

Theoretical Implications

  • Data-Centric AI: The study reinforces the paradigm that high-quality, domain-aligned data is a primary driver of performance in specialized clinical AI systems.
  • Multimodal Reasoning: Persistent cross-modal performance gaps highlight the need for further research in vision-language alignment, particularly for complex medical imaging tasks.

Future Work

  • Expansion of the dataset to additional institutions and pathologies.
  • Scaling model size beyond 7B parameters and exploring reinforcement learning for further performance gains.
  • Direct, comprehensive benchmarking against leading proprietary models (e.g., GPT-4, Gemini) to establish clearer performance boundaries.

Conclusion

SpineBench and SpineMed-450k represent a significant advance in the infrastructure for clinically relevant, level-aware AI in spine care. The results demonstrate that specialized, clinician-curated instruction data is essential for enabling fine-grained, multimodal reasoning in high-stakes medical domains. The benchmark exposes systematic weaknesses in current LVLMs and provides a robust foundation for future research and development of clinically deployable AI systems in orthopedics and beyond.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.