BrainBench: Neuro-AI Benchmarking for Imaging & Prediction
- BrainBench is a dual benchmark framework that evaluates neuro-AI models on clinical imaging VQA and neuroscience predictive tasks.
- The imaging benchmark assesses MLLMs across 15 modalities and 15 clinical tasks using metrics like accuracy, F1, and Cohen’s κ while comparing against expert radiologists.
- The predictive benchmark measures the capacity of LLMs and human experts to forecast neuroscience outcomes through forward-looking experiment abstracts with rigorous calibration.
BrainBench refers to two distinct but influential benchmarks in neuro-AI research: (1) a comprehensive brain-imaging visual question-answering (VQA) suite designed to evaluate multimodal LLMs (MLLMs) across clinical workflows (Peng et al., 2 Nov 2025), and (2) a forward-looking benchmark for quantifying the ability of LLMs and human experts to forecast outcomes of novel neuroscience experiments (Luo et al., 2024). Both are unified by their methodological rigor, task diversity, and focus on scientifically meaningful prediction, but target different axes of neuro-AI: clinical imaging and scientific literature.
1. Benchmark Scope and Definition
Brain Imaging Analysis Benchmark
BrainBench (as described in "OmniBrainBench") is a multimodal VQA benchmark for brain imaging that systematically assesses MLLMs on clinically relevant tasks spanning the full neuroradiology workflow. It comprises 15 brain imaging modalities , including coarse classes (CT, MRI, PET, SPECT, anatomical diagrams, histopathology) and fine-grained MRI subtypes (T1W, T2W, FLAIR, DWI, etc.). The benchmark contains images and clinically validated VQA pairs, with denoting clinical tasks mapped onto five sequential clinical stages: anatomical and imaging assessment (AIA), lesion identification and localization (LIL), diagnostic synthesis and causal reasoning (DSCR), prognostic judgment and risk forecasting (PJRF), and therapeutic cycle management (TCM).
Neuroscience Result Prediction Benchmark
BrainBench (as introduced in "LLMs surpass human experts in predicting neuroscience results") is a benchmark purpose-built to assess the predictive capacity of models and human experts regarding the outcomes of novel neuroscience experiments. Unlike factoid or backward-looking QA, this is a forward-looking benchmark: test items consist of published abstracts (post-training cutoff) that have their results sentences minimally but substantively modified to generate true-vs-fabricated pairs. The task is to choose which version reflects the real, published outcome. Data are drawn from all 2023 Journal of Neuroscience abstracts (five subfields), resulting in 200 expert-authored and 100 GPT-4-generated (+expert-vetted) test cases.
2. Dataset Construction and Annotation
Imaging Analysis Dataset Pipeline
Raw imaging was aggregated from 30 sources (e.g., BraTS, fastMRI, VQA-RAD, ADNI, RSNA ICH, Radiopaedia), yielding 597,853 images and 259,628 QA pairs ("BrainBench-Raw") prior to curation. Data formats included DICOM, NIfTI, and 2D views (axial/sagittal/coronal) were determined under radiologist supervision. Annotation involved metadata-driven, rule-based question templating (modality- and disease-specific), GPT-5-powered distractor generation and rephrasing, and automatic filtering (non-brain exclusion, deduplication via Sentence-BERT/DINO-V2 clustering). Expert radiologists (13+ years’ experience) validated assignments and preserved complex multi-image pairs for reasoning-heavy tasks (e.g., preoperative assessment). The curated benchmark retains 31,706 images and 9,527 QA pairs with a class-balanced, clinical task-aligned distribution.
Neuroscience Prediction Data Protocol
Each BrainBench test case consists of two versions of an abstract: the true, published version, and an altered version with a fabricated (but plausible) results sentence—constructed by flipping effect directions or manipulating conclusion-relevant elements while preserving linguistic and methodological coherence. The corpus exclusively comprises 2023 articles, ensuring all items postdate model training. For the BrainGPT model, 2002–2022 PubMed and PMC open-access journal corpora (∼456k documents, ∼1.3B tokens) were used for LoRA-based parameter-efficient adaptation; all test abstracts are held out.
3. Evaluation Metrics and Protocols
Imaging VQA Metrics
Models are scored using overall accuracy
macro-averaged F₁
and Cohen’s
where is the observed agreement and the expected agreement by chance. Models are evaluated in a strict zero-shot manner over all tasks, with comparison to three board-certified neuroradiologists.
Neuroscience Prediction Metrics
Primary metric is accuracy: Calibration is assessed via Expected Calibration Error (ECE): Model confidence is derived from the relative difference in perplexity between abstract versions: For humans, confidence is self-rated post hoc. Only valid expert responses (response time ≥ 5 s, correct catch-trials, no study recognition) are counted.
4. Baseline Results and Model Analysis
Imaging VQA Model Performance
The hierarchy in overall accuracy is: clinicians (91.35%), proprietary MLLMs (best: Gemini-2.5-Pro, 66.58%), medical-specialized MLLMs (HuatuoGPT-V-34B, 63.56%), and open-source MLLMs (Qwen3-VL-30B, 56.40%). Model performance stratifies by task: open-source MLLMs approach 76% on AIA tasks, proprietary models reach 84% in LIL, while all MLLMs decline sharply in preoperative assessment (PA) and surgical risk (RS)—lagging clinicans by >20 percentage points in these reasoning-heavy phases. No formal -values are reported; observed effect sizes are large.
Neuroscience LLM Expert Comparison
General-purpose LLMs (GPT-3.5, GPT-4, Llama2, Falcon, Mistral) achieve mean zero-shot accuracy ≈ 81.4% (SD ≈ 3%), while human expert accuracy is 63.4% overall and 66.2% among the self-rated top 20%. LoRA-based neuroscience adaptation (BrainGPT, Llama2-7B-chat) improves accuracy by ≈ 3% absolute (~80% to ~83%), and a paired -test confirms this improvement (, ). Both LLMs and humans exhibit well-calibrated confidence, with positive slopes in bin-wise calibration curves. Spearman itemwise error correlation is low (≈ 0.15), indicating that human and LLM errors are largely complementary.
5. Critical Challenges and Observed Gaps
Imaging Benchmark Challenges
A pronounced visual-to-clinical reasoning gap exists: MLLMs demonstrate high accuracy on tasks requiring direct perception (e.g., ASI, IMI, AS), but struggle markedly with complex, multi-step clinical reasoning such as risk stratification and surgical planning (e.g., PA, RS, TPS). Common error modes include mislabeling fine-grained anatomical features, over-reliance on textual priors embedded in distractor options, and insufficient 3D/contextual reasoning in multi-image tasks. Notably, MLLMs lack sufficient neuroanatomical pretraining, are limited in long-context planning, and occasionally generate hallucinated causal explanations (PMC).
Predictive Benchmark Challenges
The forward-looking nature creates a strict separation from any pretraining leakage. The primary model error stems from semantic subtlety in results sentences and domain-nuanced reasoning required to discern likely vs. fabricated findings. Notably, the LLM’s pattern of mistakes does not overlap with human experts, suggesting that hybrid teams could be complementary.
6. Implications and Prospects
BrainBench (imaging) is the first benchmark to comprehensively span 15 imaging modalities and 15 clinical tasks, each mapped to a phase of real-world clinical workflow. It pushes the field toward MLLMs with improved multimodal fusion, 3D context integration, longitudinal follow-up capability, and interpretable, risk-calibrated outputs. Suggested extensions include 3D volume QA, longitudinal imaging for treatment tracking, Brier scores for risk assessment, and human–MLLM collaborative studies.
The forward-looking neuroscience BrainBench demonstrates that large-scale LLMs can integrate cross-study findings more accurately than domain experts in forecasting novel results, and that domain tuning via LoRA yields further gains. The methodology is domain-general and can be ported to rapidly evolving research areas outside neuroscience, supported by automated test-case pipeline (LLM-generated and expert vetted) and continuous few-shot model refreshing. High-quality confidence calibration enables human–machine teaming strategies where deferral and collaboration can be optimal.
Outstanding questions include how to operationalize structured risk guideline compliance in MLLMs (e.g., RANO protocols), architect models for multi-image and multiphase clinical reasoning (e.g., memory-augmented transformers), and develop principled frameworks for clinical-grade interpretability and regulatory alignment.
7. Significance and Future Directions
BrainBench establishes new standards for objective evaluation of AI systems in neuroradiology and neuroscience prediction. The imaging VQA suite exposes critical deficiencies in current MLLMs, especially in complex visual-to-reasoning integration, and serves as a reference for advancing clinical translation of multimodal models. The result-forecasting benchmark quantifies and contextualizes the magnitude by which LLMs (and adapted variants) can outperform experts in synthesizing and predicting scientific outcomes, suggesting a future landscape in which synergistic human–AI approaches accelerate both bench-to-bedside translation and fundamental neuroscience discovery (Peng et al., 2 Nov 2025, Luo et al., 2024).