HeadQA: Biomedical Reasoning Benchmark
- HeadQA is a benchmark comprised of authentic multiple-choice questions from Spanish medical exams that require both factual recall and multi-hop clinical inference.
- It expands from HeadQA v1 to v2 by adding diverse languages, image-linked questions, and advanced prompting and retrieval methods.
- Selective Chain-of-Thought and interpretable multi-step frameworks enhance efficiency and transparency, pointing to future directions in biomedical QA.
HeadQA is a large-scale, multilingual benchmark designed to evaluate complex reasoning abilities of artificial intelligence systems in the healthcare domain. Originating from official Spanish medical‐specialty board examinations, HeadQA consists of challenging, multiple‐choice questions that require a spectrum of cognitive skills from factual recall to multi-hop clinical inference. Advances in dataset construction (HeadQA v1 to v2), benchmarking protocols, and model development have established HeadQA as a critical testbed for assessing and advancing biomedical reasoning in natural language processing.
1. Dataset Origins, Structure, and Expansion
HeadQA was introduced to address the need for an authentic and demanding benchmark for complex reasoning in medical question answering. The initial release comprised 6,765 multiple-choice questions extracted from official Spanish Ministry of Health specialization exams, specifically targeting disciplines such as medicine (MIR), nursing (EIR), psychology (PIR), biology (BIR), chemistry (QIR), and pharmacy (FIR) (Vilares et al., 2019). These questions represent real-world, high-stakes evaluation and encompass a variety of reasoning types, including multi-step diagnosis, procedural calculations, and image-based interpretation (with ~14% of MIR questions linked to supporting figures).
HeadQA v2 significantly expands coverage, compiling 12,751 questions from a decade of Spanish exams spanning 2013–2022 and introducing machine-translated variants in English, Italian, Galician, and Russian (Correa-Guillén et al., 19 Nov 2025). The v2 release incorporates 334 image-linked questions and explicit conversion of chemical formulas into SMILES, reflecting increased conceptual and linguistic diversity. An overview of dataset scale by language is shown below:
| Version | #Questions | Languages |
|---|---|---|
| HeadQA v1 | ~6,800 | Spanish, English |
| HeadQA v2 | 12,751 | Spanish, English, Italian, |
| Galician, Russian |
HeadQA v2 omits pre-defined training/dev/test splits, instead providing a unified benchmarking collection to facilitate cross-model and cross-lingual studies.
2. Question Formats and Reasoning Taxonomy
HeadQA questions are structured as multiple-choice items, with four answer options (2015–2022) or five (2013–2014). Each item includes a question stem and labeled answer choices; some include medical illustrations. The question pool spans diverse reasoning demands:
- Factual Recall: Direct queries on specific knowledge units (e.g., drug indications, chemical reactions).
- Conceptual Understanding: Interpretation, classification, or explanation of healthcare concepts.
- Multi-hop Inference: Stepwise reasoning across patient vignettes, symptoms, tests, and diagnoses.
- Procedural Reasoning: Numeric calculations or protocol-based choices (e.g., pharmacology dose computations).
- Image-based Reasoning: Clinical images requiring visual interpretation (primarily in MIR questions).
Typical token spans per question range from ~11 to ~55 for prompts and ~5 to ~9 for responses (Vilares et al., 2019). Reasoning-type distribution in HeadQA places significant emphasis on multi-hop and domain-specific inference, making it more challenging than general-purpose biomedical QA datasets.
3. Benchmarking Protocols and Methodological Advances
Benchmarks on HeadQA evaluate exact-match answer selection accuracy, often reporting additional metrics such as normalized exam scores and unanswered ratios (where applicable). Several methodological paradigms have been explored:
- Prompting: Zero-shot, few-shot, and chain-of-thought (CoT) prompting strategies. CoT is particularly tailored to HeadQA’s clinical cases, with prompts for explicit evidence listing, differential diagnosis, and distractor elimination (Zhan et al., 31 Mar 2026).
- Retrieval-Augmented Generation (RAG): Incorporates external biomedical corpora (e.g., MedRAG) via embedding-based retrieval, prepending top-ranked evidence snippets to the model context (Correa-Guillén et al., 19 Nov 2025).
- Probability-Based Selection: Computes the generation likelihood for each answer option, selecting the one with maximal log-probability under the model (Correa-Guillén et al., 19 Nov 2025).
- Selective Chain-of-Thought: Dynamically decides (via a learned or prompted binary classifier f_sel) whether to invoke stepwise reasoning based on whether a question demands it, reducing unnecessary rationale generation for recall-type items while focusing effort on complex cases (Zhan et al., 23 Feb 2026).
Below is a summary of key HeadQA benchmarking results for state-of-the-art LLMs (English/Spanish):
| Model | Method | Accuracy (%) | Source |
|---|---|---|---|
| Llama-3.1-70B | Prompting | ~83 | (Correa-Guillén et al., 19 Nov 2025) |
| Mixtral-8×7B | Prompting | ~70 | (Correa-Guillén et al., 19 Nov 2025) |
| Mistral-7B | Prompting | ~60 | (Correa-Guillén et al., 19 Nov 2025) |
| GPT-4o (CoT) | Test split | 91.39 | (Zhan et al., 31 Mar 2026) |
| GPT-4o-mini (CoT) | Test split | 79.10 | (Zhan et al., 31 Mar 2026) |
Human physicians in the original HeadQA v1 scored approximately five times higher in point-based metrics than early neural and IR-based systems, underscoring the difficulty of the task (Vilares et al., 2019).
4. Model Performance, Error Analysis, and Self-Reflection
Despite notable gains from contemporary LLMs, the HeadQA benchmark exposes persistent methodological challenges:
- Model-Scale Dominance: LLMs with substantially larger parameter counts (e.g., Llama-3.1-70B, GPT-4o) consistently surpass smaller models, both in accuracy and robustness (Correa-Guillén et al., 19 Nov 2025, Zhan et al., 31 Mar 2026).
- Inference Complexity: Increased architectural or prompting complexity (e.g., RAG, CoT, probability-based selection) yields marginal or even negative returns relative to simple zero/few-shot prompting in high-capacity models (Correa-Guillén et al., 19 Nov 2025).
- Self-Reflective Reasoning: Iterative self-critique (reflection loops) fails to yield consistent accuracy improvements on HeadQA. In one study, GPT-4o’s accuracy dropped slightly from 91.39% (CoT) to 90.57% (after up to 10 reflection steps), while GPT-4o-mini remained unchanged at 79.10% (Zhan et al., 31 Mar 2026). Most chains of reasoning terminated after zero or one reflection, and extended iterations often led to confirmation bias or spurious error introduction rather than reliable correction. This reveals a gap between reasoning transparency and correctness.
Two dominant LLM failure modes are persistence of initial logical errors and over-rationalization ("confirmation bias"), rarely resulting in corrective answer flips. Reflection combined with external knowledge retrieval or stepwise critique at substage levels is proposed as a future improvement direction (Zhan et al., 31 Mar 2026).
5. Innovations in Efficient Reasoning
Selective Chain-of-Thought (Selective CoT), as investigated on HeadQA, implements a dynamic control mechanism: the model first predicts if detailed reasoning is needed and proceeds with CoT only when beneficial (Zhan et al., 23 Feb 2026). This approach yields the following:
- For Qwen-2.5-7B, Selective CoT improved both accuracy (+8.7%) and reduced output tokens (–19%) and inference latency (–17.6%) compared to always rendering CoT.
- For Llama-3.1-8B, the method marginally decreased accuracy (–4.05%) but preserved compute savings.
- Selective CoT aligns well with the significant share (~40–50%) of recall-based items in HeadQA, removing redundant token generation on already ‘known’ questions.
No fine-tuning or external classifier is required to deploy Selective CoT; a prompt-engineered “¿Sí/No?” classifier suffices (Zhan et al., 23 Feb 2026).
6. Specialized Architectures and Interpretability
Traditional IR and neural comprehension models underperform on HeadQA relative to humans, achieving ~33–37% accuracy and struggling with domain mismatch, retrieval failures, and multi-hop reasoning (Vilares et al., 2019). The MurKe framework [Editor's term] exemplifies an interpretable, multi-step knowledge extraction and reasoning approach:
- Knowledge Extraction: Retrieves and prunes candidate documents based on token-level and BioBERT-based semantic relevance.
- Iterative Reasoning: Maintains a latent “question” state, selecting documents, reformulating the query, and scoring entailment at each step.
- Interpretability: Key evidence tokens contributing to decisions are explicit at every reasoning step; the full chain from document selection through final answer is transparent.
- Empirical Improvements: MurKe achieves 46.7% accuracy and 199.8 points, outperforming prior multi-hop and entailment baselines by ~4 points in exam-style scoring (Liu et al., 2020).
A plausible implication is that future HeadQA architectures may benefit from combining such interpretable, iterative frameworks with large-scale pretraining to further narrow the human–machine performance gap.
7. Future Directions and Open Challenges
HeadQA remains a persistent frontier for AI research in biomedical reasoning:
- Model Adaptation: Fine-tuning on HeadQA or leveraging domain-specific pretraining data in Spanish are promising for further advances (Correa-Guillén et al., 19 Nov 2025).
- Improved External Knowledge Integration: Retrieval-augmented methods currently yield inconsistent benefits; new grounding or counterfactual approaches could support genuine self-correction (Correa-Guillén et al., 19 Nov 2025, Zhan et al., 31 Mar 2026).
- Multimodal Expansion: Approximately 2–3% of questions are image-based; multimodal models that incorporate visual reasoning are an open research direction (Vilares et al., 2019).
- Robust Evaluation: Standardizing data splits (HeadQA v1/v2), reporting statistical significance, and extending to open-answer variants and generative evaluation metrics (e.g., BLEU, exact match) are highlighted gaps (Correa-Guillén et al., 19 Nov 2025).
- Language Transfer: Machine-translated benchmarks in English, Italian, Galician, and Russian support cross-lingual studies and generalizability assessments.
HeadQA thus provides not only a linguistically and conceptually rich benchmark for evaluating emergent reasoning capabilities in LLMs, but also a rigorous testbed for systematic improvements in clinical QA methodologies (Correa-Guillén et al., 19 Nov 2025, Zhan et al., 31 Mar 2026, Vilares et al., 2019, Liu et al., 2020).