Brain Tumor VQA Benchmark
- Brain Tumor VQA Benchmark is a comprehensive framework that integrates diverse MRI modalities and multi-center datasets to evaluate vision-language models for neuro-oncology.
- It employs advanced annotation protocols and rejection-aware evaluation methods to counteract modality collapse, positional bias, and forced-choice shortcutting.
- Evaluation metrics reveal that while current models underperform compared to human experts, the benchmark drives improvements in volumetric and clinical reasoning.
Visual question answering (VQA) benchmarks for brain tumors evaluate models’ ability to integrate multi-sequence magnetic resonance imaging (MRI) information with clinically relevant reasoning in neuro-oncology. These resources are pivotal for the development and assessment of vision-LLMs (VLMs) and multimodal LLMs (MLLMs) for diagnostic, prognostic, and therapeutic tasks in brain tumor imaging. Benchmarks now encompass a range of tasks, question types, and annotation rigor reflecting advances in LLM/MLLM architectures, large-scale multi-center dataset integration, and domain-specific evaluation methodologies.
1. Dataset Construction and Composition
Brain tumor VQA benchmarks differ principally in imaging granularity, task scope, and annotation pipeline.
- Cohort Sources and Modalities: Key resources include curated subsets from the Brain Tumor Segmentation (BraTS) challenge (glioblastoma (GLI), meningioma (MEN), metastases (MET)), comprehensive public neuroimaging repositories (UCSF-PDGM, MM-NeuroOnco, NeuroQA, OmniBrainBench), as well as multi-institutional aggregations encompassing 2D slices, 3D volumes, and multi-modal data (MRI, CT, PET, SPECT) (Safari et al., 14 Aug 2025, Peng et al., 2 Nov 2025, Ghosh et al., 16 May 2026, Abbasi et al., 19 May 2026, Guo et al., 26 Feb 2026).
- Sample Overview:
| Benchmark | Main Cohorts | Imaging Type | N(QA pairs) | Tumor Focus (%) | |---------------------|--------------------------|-------------------------------|-------------|------------------| | GPT-5/BraTS | BraTS-GLI, MEN, MET | 3-plane MRI mosaics | (not stated)| 100 | | OmniBrainBench | 30 sources, 15 modalities| 2D slices, multi-modal | 9,527 | ~22 | | UCSF-PDGM-VQA | UCSF-PDGM (gliomas) | 3D multi-sequence MRI | 2,387 | 100 | | NeuroQA | BraTS-GLI/–MEN | 3D MRI volumes | 5,746 | 100 | | MM-NeuroOnco | 20 repositories, 8 tumors| 2D slices, 4 seqs, mask-level | ~200,000 | Focus |
Benchmarks span closed-ended multiple-choice (MCQ, Yes/No, numeric), open-ended and segmentation-augmented VQA, with structured radiology/clinical descriptors and mask-based ground truth typically used for annotation (Safari et al., 14 Aug 2025, Peng et al., 2 Nov 2025, Abbasi et al., 19 May 2026, Guo et al., 26 Feb 2026).
- Annotation Protocols:
- Rule-based, LLM-assisted, and expert-verified QA pipelines (e.g., GPT-4o/GPT-5.2, dual-VLM consensus, FreeSurfer/BraTS alignment, multi-pass clinical review)
- Inter-annotator agreement above 95% (OmniBrainBench)
- Rejection-aware labeling to prevent forced-choice bias (MM-NeuroOnco)
2. VQA Taxonomy, Task Definition, and Evaluation Approaches
Benchmarks categorize questions according to clinical reasoning granularity:
- Question Types:
- Visual/Perceptual: directly observable MRI features such as enhancement, edema, mass effect, lesion boundary (e.g., "Does the lesion show ring enhancement?") (Safari et al., 14 Aug 2025, Ghosh et al., 16 May 2026).
- Clinical Reasoning: integrating visual cues with domain knowledge for differential diagnosis, grading, treatment planning (e.g., "What is the most likely tumor grade?") (Peng et al., 2 Nov 2025, Abbasi et al., 19 May 2026).
- Measurement/Localization: precise quantification (e.g., maximum dimension, lobe/region, margin definition) (Safari et al., 14 Aug 2025, Abbasi et al., 19 May 2026).
- Open-ended: free-form answers justifying findings or proposing management (Guo et al., 26 Feb 2026).
- Evaluation Metrics:
- Accuracy: fraction of correctly answered QA pairs
- Dice Coefficient: lesion segmentation overlap (as needed)
- F1-score: multi-label grading/differential
- Shortcut Score (Editor's term): normalizes image-grounded gain above the text-only floor (Abbasi et al., 19 May 2026). - Rejection Rate:
QA Balance and Reliability: Some benchmarks, such as NeuroQA and MM-NeuroOnco, implement answer-distribution refinement ensuring that text-only accuracy is forced to near chance, highlighting true visual reasoning (Abbasi et al., 19 May 2026, Guo et al., 26 Feb 2026).
3. Baseline and State-of-the-Art Model Performance
Models are grouped by architecture and domain specialization, with all recent evaluations conducted in strict zero-shot or rejection-aware settings.
Proprietary MLLMs (GPT-5 family, Gemini-2.5/3-Pro, Claude-Sonnet-4.0): macro-average accuracy for closed-ended VQA in the 40%–67% range; Gemini-3-Flash attains 40.9% on diagnosis QA (MM-NeuroOnco), GPT-5 achieves 44.19% on BraTS-based VQA (GLI/MEN/MET) (Safari et al., 14 Aug 2025, Guo et al., 26 Feb 2026).
Medical MLLMs (HuatuoGPT-V, Lingshu, MedVLM, MedGemma): typically 47–64% overall accuracy, yet 30–40% on preoperative assessment and diagnosis; Lingshu-32B reaches 63.2% on UCSF-PDGM-VQA (Peng et al., 2 Nov 2025, Ghosh et al., 16 May 2026).
Open-source VLMs (Qwen3-VL, Janus-Pro, InternVL): 45–56% overall but lowest (<40%) on complex tumor grading (Peng et al., 2 Nov 2025).
Text-only Baseline: Often nearly equivalent performance (52.57% for Qwen3-8B), revealing language-prompt shortcut risk (Ghosh et al., 16 May 2026).
Across benchmarks, human experts (board-certified neuroradiologists) exhibit 88–91% accuracy, uniformly 30–40 percentage points above the best MLLMs (Safari et al., 14 Aug 2025, Peng et al., 2 Nov 2025, Ghosh et al., 16 May 2026).
| Model Category | Tumor VQA Acc. (%) | Source |
|---|---|---|
| GPT-5-mini | 44.19 | (Safari et al., 14 Aug 2025) |
| Gemini-3-Flash | 40.9 | (Guo et al., 26 Feb 2026) |
| MedGemma-1.5 (multi) | 63.57 | (Ghosh et al., 16 May 2026) |
| Qwen3-8B (text-only) | 52.57 | (Ghosh et al., 16 May 2026) |
| Human Expert | 88–91 | (Safari et al., 14 Aug 2025, Peng et al., 2 Nov 2025, Ghosh et al., 16 May 2026) |
4. Failure Modes, Biases, and Technical Limitations
Common weaknesses are observed among current VLMs/MLLMs, irrespective of model scale:
Modality Collapse: VLMs often ignore visual input, defaulting to language priors. For example, Lingshu-32B and Med3DVLM both increase accuracy when presented with blank images; text-only baselines match or exceed vision-integrated models on closed questions (Ghosh et al., 16 May 2026).
Positional Bias: Order of answer options causes 10–20% performance swings, independent of semantic content (Ghosh et al., 16 May 2026).
Inadequate 3D Context: None of the evaluated models natively handle multi-series, full-volume MRI, leading to “single-slice” reasoning and fragmentation (Ghosh et al., 16 May 2026, Guo et al., 26 Feb 2026).
Over-reliance on Forced-Choice: Benchmarks lacking rejection or abstain options mask genuine uncertainty, encouraging overconfident guesses and hallucinations (Guo et al., 26 Feb 2026).
Clinical Implications: Such biases risk inaccurate lesion quantification, localization, and mass effect assessment, with direct impact on surgical/radiotherapy targeting and prognosis (Ghosh et al., 16 May 2026).
5. Innovations in Evaluation Design and Annotation
Recent brain tumor VQA benchmarks introduce methodologies to actively counter shortcutting, hallucination, and annotation drift:
Image-Grounding Protocols: NeuroQA mandates that image-absent accuracy falls to chance (~39.5%) and monitors for fabrications and hallucinations when no image is present (Abbasi et al., 19 May 2026).
Shortcut Suppression: Answer-distribution refinement and balanced MCQ/YN insertion ensure any gain above the text-only floor reflects true image reasoning (Abbasi et al., 19 May 2026).
Rejection-Aware Evaluation: MM-NeuroOnco’s “None of the above” reduces forced-choice bias, and ablation studies quantify the effect of rejection options on accuracy (Guo et al., 26 Feb 2026).
Deterministic Audit Pipelines: Multi-stage rule pipelines (e.g., in NeuroQA, OmniBrainBench) enforce ground-truth alignment for all answers, with clinical, machine, and script-based review stages (Peng et al., 2 Nov 2025, Abbasi et al., 19 May 2026).
6. Strategies for Performance Improvement and Future Research Trajectories
Several improvement paths are outlined across the benchmark literature:
Model Architecture:
- Native 3D/multisequence inputs; cross-slice or volumetric attention; region-of-interest guidance (Ghosh et al., 16 May 2026).
- Enhanced multimodal fusion layers (pixel-level cross-attention) and integration of segmentation pre-processing (Safari et al., 14 Aug 2025, Peng et al., 2 Nov 2025).
- Chain-of-Thought regularization and prompt ensembling to stabilize reasoning (Guo et al., 26 Feb 2026, Safari et al., 14 Aug 2025).
- Training Curriculum:
- Domain-specific fine-tuning on neuro-oncology MRI corpora, using contrastive vision–language objectives (Peng et al., 2 Nov 2025, Safari et al., 14 Aug 2025).
- Multi-task pretraining: joint segmentation, detection, classification, and clinical VQA (Guo et al., 26 Feb 2026).
- Benchmark and Workflow Expansion:
- Introduction of progressive clinical difficulty (single-image to longitudinal/temporal), open-ended and multi-turn dialogue VQA, human–AI interactive review, and explicit uncertainty quantification (ECE, Brier score) (Abbasi et al., 19 May 2026, Guo et al., 26 Feb 2026).
- Incorporation of segmentation-plus-VQA and linkage of clinical text (structured radiology findings) for complex numerics (Peng et al., 2 Nov 2025, Abbasi et al., 19 May 2026).
- Annotation and QA Rigor:
- Stratified sampling to reflect real-world case diversity, expert- and machine-audited QA, and reporting of inter-annotator agreement (Peng et al., 2 Nov 2025, Abbasi et al., 19 May 2026, Guo et al., 26 Feb 2026).
7. Clinical Significance and Benchmark Limitations
Despite increasing dataset scale, annotation rigor, and advancement in VLM/MLLM architecture, model accuracy in tumor-related VQA remains significantly below clinical acceptability (typically ≤ 50% vs ≥ 90% for human experts). Model errors—driven by insufficient visual grounding, shortcutting, and lack of volumetric context—impede reliable translation to high-stakes neuro-oncology workflows (Safari et al., 14 Aug 2025, Peng et al., 2 Nov 2025, Ghosh et al., 16 May 2026). Current benchmarks may overestimate reliability when relying solely on closed-ended, forced-choice tasks without rejection-aware or calibration measures (Guo et al., 26 Feb 2026).
This suggests that benchmarks must systematically audit for language priors, enforce image-grounded reasoning, and standardize uncertainty-aware metrics to catalyze clinically safe and actionable deployment. A plausible implication is that future benchmarks will further expand diagnostic horizon (e.g., perfusion/diffusion imaging, molecular imaging, longitudinal studies), integrate complex multi-stage clinical reasoning, and require ensemble or human-in-the-loop models for robust, real-world performance (Peng et al., 2 Nov 2025, Abbasi et al., 19 May 2026).