Neurosurgical Anatomy Benchmark (NeuroABench)
- NeuroABench is a multimodal evaluation framework that curates annotated neurosurgical operative videos to benchmark ML models on fine-grained anatomical identification.
- It features an integrated annotation pipeline combining automated summarization, video analysis, and expert-controlled QA to ensure high dataset fidelity.
- Evaluation metrics include instance-level and anatomy-level accuracy, precision, recall, and F1 scores, highlighting current MLLM performance gaps versus clinical trainees.
The Neurosurgical Anatomy Benchmark (NeuroABench) is a multimodal evaluation framework designed to rigorously assess the performance of machine learning models, particularly Multimodal LLMs (MLLMs), in identifying fine-grained anatomical structures within neurosurgical operative videos. Developed in response to the lack of standardized datasets focusing on clinical anatomical understanding, NeuroABench provides a curated set of annotated surgical videos, structured QA tasks, and robust evaluation metrics. It establishes the first specialized platform dedicated to advancing automated anatomical comprehension, addressing an essential need in surgical education, intraoperative assistance, and medical AI research (Song et al., 7 Dec 2025).
1. Dataset Composition and Scope
NeuroABench is constructed from 89 clinician-curated neurosurgical approach recordings, representing a total of 9 hours of high-quality operative footage. This selection was drawn from an initial set of 886 videos hosted on the Neurosurgical Atlas, ensuring both procedural and anatomical diversity. The dataset covers 32 distinct neurosurgical approaches—examples include pterional craniotomy, retrosigmoid approach, and interhemispheric fissure dissection—encompassing 89 different operative procedures. Alongside the video corpus, 32 clinician-reviewed teaching manuals detail procedural stages and key anatomical landmarks.
The anatomical taxonomy consists of 68 classes spanning cortical, subcortical, vascular, neural, meningeal, and osseous landmarks. Categories include:
- Cerebral hemispheres and lobes: frontal, parietal, temporal, occipital, insula
- Gyri and sulci: precentral, postcentral, Sylvian fissure, central sulcus, cingulate
- Deep nuclei/white matter: caudate, putamen, globus pallidus, internal capsule
- Limbic: hippocampus, amygdala
- Ventricular system: lateral, third, fourth ventricles
- Brainstem/cerebellum: midbrain, pons, medulla, cerebellar peduncles, tonsils
- Dural partitions/meninges: dura mater, falx cerebri, tentorium cerebelli, arachnoid mater, pia mater
- Major arteries/veins: anterior/middle/posterior cerebral arteries, internal carotid, basilar, sagittal/transverse/sigmoid sinuses, vein of Galen
- Cranial nerves/foramina: optic nerve/chiasm, oculomotor, trigeminal, facial, vestibulocochlear, hypoglossal, internal auditory canal, foramen magnum, jugular foramen
- Skull base/air cells: clivus, sella turcica, sphenoid sinus, mastoid air cells
Data modalities include video frames extracted at 1 Hz with embedded audio (surgeon narration), structured per-frame JSON annotations (surgical phase, key anatomy, temporal location), and metadata from manuals and automatic workflow summaries.
2. Multimodal Annotation Pipeline
The annotation workflow comprises three sequential stages integrating LLM-based automation with rigorous expert review:
- Automatic Workflow Summarization Raw teaching manuals are processed using OpenAI-o1 to generate draft procedural summaries, aligning each operative step with relevant anatomical structures. A senior neurosurgeon validates and corrects these drafts for clinical accuracy.
- Automated Video Annotation Gemini-1.5-Pro ingests visual frames and synchronized audio, mapping video segments to annotated procedural steps. Outputs are structured as JSON tuples containing the surgical phase index (step_number), anatomical landmark (key_anatomy), and temporal interval (time_period). The pipeline standardizes terminology (e.g., expanding “IAC” to “Internal Auditory Canal”, merging synonyms like “Dura” and “Dura Mater”).
- Expert-Driven QA Generation and Quality Control For each annotated operative interval, frames are sampled at one per second. A neurosurgical panel undertakes two full review cycles to exclude image–anatomy mismatches, low-quality images, or ambiguous content. For each selected frame, five predefined anatomical identification prompts are generated, each as a multiple-choice QA: four plausible structures and a distractor (“None of the above”). Correct answer slotting is balanced at 20% per choice. The final benchmark comprises 1,079 image–question–answer instances.
3. Benchmark Task Design and Evaluation Metrics
NeuroABench casts anatomical identification as a zero-shot, frame-level, multiple-choice task. Given a visual frame (optionally with audio/text prompt), models are required to select the correct anatomical structure from five options. Two distinct evaluation granularities are employed:
- Instance-Level: metrics (accuracy, macro precision, macro recall, macro F1) averaged over all QA pairs.
- Anatomy-Level: metrics calculated per anatomical class and macro-averaged across the 68 classes.
Formally, let be the number of test instances, the number of choices, and an indicator of correctness:
Further metrics include:
- Accuracy
This design ensures balanced evaluation across both data instances and anatomical classes.
4. Experimental Evaluation: Model and Human Performance
Over ten state-of-the-art MLLMs were benchmarked in a strict zero-shot setting. Leading results (instance-level accuracy) were:
| Model | Instance Acc. | Precision | Recall | F1 |
|---|---|---|---|---|
| Gemini-2.0-Flash | 40.87% | 46.61% | 41.07% | 38.56% |
| Claude-3.5-Sonnet | 40.41% | — | — | — |
| Qwen2.5-VL-72B | 37.44% | — | — | — |
| Random Baseline | 19.48% | — | — | — |
Detailed anatomy-level metrics for Gemini-2.0-Flash were: precision 29.68%, recall 27.02%, F1-score 25.52% (Song et al., 7 Dec 2025).
For human comparison, four neurosurgical trainees completed a subset of the benchmark:
| Trainee | Accuracy |
|---|---|
| #1 | 48% |
| #2 | 28% |
| #3 | 56% |
| #4 | 54% |
The average human accuracy was 46.5%, with scores ranging from 28% to 56%. Notably, the leading MLLM's performance (40.87%) was comparable only to the lowest-scoring human and trailed the human group mean by 10.5 percentage points. Human accuracy exhibited greater variance (28–56%) than model accuracy (∼18–41%).
A plausible implication is that, while MLLMs display stability, their anatomical recognition capabilities are not yet competitive with the average clinical trainee—a significant limitation for applications in safety-critical environments.
5. Key Findings, Limitations, and Methodological Insights
NeuroABench analysis highlights several substantive findings:
- Contemporary MLLMs exhibit significant deficiencies in recognizing fine-grained, intraoperative neuroanatomy, with best-case performance (∼41%) markedly below clinical standards and below average human trainee level.
- Medical-specialized MLLMs (e.g., HuatuoGPT-Vision-34B) yield only modest gains over generalist models, suggesting current data and pre-training objectives undervalue intraoperative anatomical context.
- Major error sources include dynamic tissue deformation, occlusion from fluids or instruments, and a pre-training bias toward textbook (static, canonical) anatomy over the variable morphologies observed during surgery.
- Current models underexploit available multimodal cues, such as synchronized audio (operative narration), which may contain essential context for disambiguation.
- The rigorous, expert-reviewed annotation protocol ensures high dataset fidelity, supporting reliable model evaluation and comparative analysis.
6. Challenges and Future Research Directions
The NeuroABench benchmark identifies multiple technical and practical challenges in advancing anatomical AI for neurosurgery:
- Dynamic Morphology: Intraoperative deformation, tissue retraction, and bleeding frequently obscure landmarks, impeding model recognition.
- Static Pre-training Bias: Models primarily learn idealized, atlas-based representations rather than the transformed geometries seen in live operative settings.
- Multimodal Grounding Deficits: Integration of temporal and auditory features remains limited, restricting models' ability to exploit redundant contextual signals.
Future work converges on several themes:
- Surgical-domain pre-training employing broad, annotated operative video corpora to capture dynamic anatomic variation.
- Human-in-the-loop fine-tuning, enabling real-time correction by surgeons or trainees to enforce correct anatomical associations.
- Spatio-temporal modeling, leveraging sequential frames and phase-aware context for continuous landmark tracking.
- Incorporation of 3D anatomical priors (preoperative imaging or atlas registration) to enhance 2D intraoperative recognition under anatomical distortion.
- Explainable models capable of structured rationales for landmark identification, strengthening interpretability and error analysis.
NeuroABench thereby establishes a comprehensive, multimodal standard for quantitative benchmarking of anatomical AI in neurosurgery, quantifying model-human gaps, supporting error auditing, and driving progress toward reliable, safety-critical AI deployment in surgical practice (Song et al., 7 Dec 2025).