HEAD-QA v2 Benchmark
- HEAD-QA v2 is a comprehensive, multilingual benchmark that compiles 12,751 exam-based questions from Spanish healthcare specialization exams.
- It employs advanced prompting strategies, retrieval augmentation, and probability-based answer selection to assess biomedical reasoning in LLMs.
- Quantitative evaluations reveal a strong model scale effect, with LLaMA 3.1 70B achieving up to 83% zero-shot accuracy in complex healthcare QA tasks.
HEAD-QA v2 is a large-scale, multilingual benchmark for evaluating complex biomedical reasoning in LLMs. Building on the initial HEAD-QA release, HEAD-QA v2 expands the dataset size, temporal scope, and linguistic diversity, offering a robust resource for research on automatic healthcare question answering and model development in both Spanish and English, as well as additional languages.
1. Dataset Composition and Expansion
HEAD-QA v2 comprises 12,751 multiple-choice questions sourced from ten years (2013–2022) of Spanish professional healthcare specialization exams. The dataset draws directly from publicly available PDFs and answer keys published by the Ministerio de Sanidad de España. Compared to HEAD-QA v1, which contained 6,765 questions from 2013–2017, HEAD-QA v2 more than doubles the question count and extends the covered period to a decade.
Each question is stored in columnar Parquet format, encoded as a JSON object with the following structure:
- Unique exam identifier and question ID
- Question stem
- List of labeled answer options (four or five, depending on year)
- Correct answer label
- Year and discipline (Medicine – MIR, Nursing – EIR, Biology – BIR, Chemistry – QIR, Psychology – PIR, Pharmacy – FIR)
- Any linked image (334 questions have associated diagrams)
Chemical formulas in questions are converted to SMILES notation for compatibility with text-only LLM pipelines. The correct answer positions are uniformly distributed, though a slight edge-avoidance bias exists due to exam design. Approximately 21% of questions (from 2013–2014) have five answer options; the remainder use four.
2. Multilingual Coverage and Translation Pipeline
HEAD-QA v2 provides validated translations for English, Italian, Galician, and Russian. English translations are generated via zero- and few-shot prompting with an instruction-tuned LLaMA-3.1-8B model. Automated validation ensures integrity in format, answer count, and numbering, with selection based on a proxy back-translation metric that incorporates BLEU and BERTScore on a held-out sample.
The pipeline achieves the highest round-trip BLEU scores and semantic correlation (BERTScore-F1) for Italian and Galician (BLEU: 0.57–0.66; BERTScore-F1: 0.77–0.80), followed by English (BLEU: 0.41; BERTScore-F1: 0.69) and Russian (BLEU: 0.33; BERTScore-F1: 0.65), confirming fidelity across languages. This enables broad cross-lingual evaluation and facilitates analysis of linguistic effects on biomedical reasoning.
3. Benchmarking Protocols and Evaluation Metrics
To assess the difficulty and inform baseline model performance, HEAD-QA v2 benchmarks four open-access instruction-tuned LLMs under three distinct inference paradigms:
- Models and Scales: LLaMA 3.1 (8B, 70B), Mistral v0.3 (7B), Mixtral v0.1 (Mixture-of-Experts, 8×7B)
- Prompting Strategies:
- Zero-shot: System/user instructions outputting JSON, e.g.,
{ Answer: 3 } - Few-shot: Augmented with three USMLE-style in-domain examples
- Chain-of-Thought (CoT): Explicit instruction for reasoning over answer options prior to final output
- Zero-shot: System/user instructions outputting JSON, e.g.,
- Retrieval-Augmented Generation (RAG):
- External corpus: 18 USMLE textbooks segmented into 126,000 passages (<1,000 characters each)
- Retrieval: MedCPT dual-encoder (768-dim vectors)
- Search and Indexing: FAISS flat index for recall; top two passages prepended to prompt
- Probability-Based Answer Selection:
- For each answer , the conditional likelihood is calculated by
$P(A_i) = \left(\prod_{j=1}^m q_j \right)^{1/m}, \quad \text{where } q_j = P(\text{token } a_j \mid \text{question}, \text{preceding tokens}), \ m = \text{#tokens}$ - Computed as the geometric mean in log space:
- Maximizing ensures zero unanswered questions
Performance is quantified via:
- Accuracy:
- Normalized exam score: Penalty scheme (three wrong answers cancel one correct), rescaled
- Unanswered ratio: Fraction of questions without valid response
4. Quantitative Outcomes and Analysis
On the English subset, LLaMA 3.1 70B achieves top zero-shot accuracy (83.15%; normalized score 84.16%), similar results under few-shot (82.90%; 84.14%) and CoT (82.54%; 84.20%). The 8B version attains around 70% accuracy (zero-shot: 70.43%), with non-response rates below 1%. Mixtral (8×7B) and Mistral (7B) models yield 59–70% accuracy, with Mixtral outperforming Mistral by 10–12 points.
RAG does not yield systematic improvement: LLaMA 3.1 70B achieves 82.45% accuracy with RAG, on par with prompting alone; smaller models experience 1–3 point reductions. The correlation between retrieval relevance and answer correctness (r = 0.07) is negligible, and probability-based answer selection underperforms text-generation: maximal accuracy of 54.15% with LLaMA 70B, nearly 30 points below zero-shot prompting.
Translation into English consistently elevates performance, notably for models with smaller parameter counts, reflecting both the utility of multilingual data and heightened LLM proficiency in English.
5. Key Insights and Limitations
Analysis reveals that model scale and intrinsic reasoning ability dominate performance, evidenced by an accuracy gains exceeding 10 points when progressing from 8B to 70B within the same architecture. Advanced prompting schemes (CoT) and retrieval-based methods (RAG) yield minimal or negative returns for this biomedical reasoning task. Probability-based answer selection provides computational efficiency but falls short in accuracy.
Constraints include exclusion of proprietary LLMs (GPT-4, Claude, Gemini) due to API costs, absence of comprehensive human validation for non-English translations, and focus on multiple-choice questions reflecting professional exams but omitting broader clinical reasoning contexts.
6. Future Research Prospects
HEAD-QA v2 establishes opportunities for:
- Fine-tuning and instruction-tuning models specifically for biomedical QA
- Designing enhanced retrieval pipelines or leveraging domain-specific corpora
- Systematic human evaluation of translations and question difficulty
- Expansion to open-ended, multi-answer formats and multimodal reasoning with image-driven questions
The dataset’s increased temporal and linguistic breadth supports investigations into temporal drift, generalization for recent clinical contexts, and robust cross-lingual and cross-disciplinary assessment.
7. Significance and Research Impact
HEAD-QA v2 provides a reliable, temporally extended, and multilingual benchmark for advancing automated biomedical reasoning and LLM model improvement. By thoroughly characterizing LLM performance across scales, inference paradigms, and languages, it informs model selection strategies and catalyzes further research in specialized healthcare QA domains. The resource also facilitates systematic benchmarking and model comparison, serving as a cornerstone for future innovations in clinical reasoning automation (Correa-Guillén et al., 19 Nov 2025).