MEDIQA-OE-2025: Clinical NLP Benchmark

Updated 20 November 2025

MEDIQA-OE-2025 Shared Task is a benchmark challenge that extracts actionable clinical orders from multi-turn doctor–patient conversations for automated EHR population.
It employs a strict JSON schema to classify orders into medication, lab, imaging, or follow-up, capturing key details and provenance from dialogue turns.
Performance is evaluated using precise F₁ metrics for order type, description, reason, and provenance, highlighting the benefits of robust prompt engineering and model scaling.

The MEDIQA-OE-2025 Shared Task is a benchmark challenge in clinical NLP focused on extracting structured medical orders from multi-turn, doctor–patient conversational transcripts. Its goal is to enable automated, high-fidelity population of electronic health records (EHRs) directly from spoken clinical dialogue, thus reducing clinician documentation burden and supporting downstream applications such as workflow automation and decision support (Karim et al., 12 Oct 2025, Corbeil et al., 30 Oct 2025, Balachandran et al., 13 Nov 2025).

1. Task Definition and Motivation

The MEDIQA-OE-2025 shared task targets the automatic extraction of actionable medical orders from longitudinal doctor–patient conversations. The central objectives are to:

Identify every medical order in the transcript.
Classify the order as one of: medication, laboratory test, imaging study, or follow-up action.
Extract the clinical description of the order (as uttered by the physician).
Capture the physician’s reason or justification (optional but preferred).
Ground each order to provenance, specifically the turn IDs in the transcript supporting the extraction.

The output is a JSON-formatted list of orders, each represented as a 4-tuple:

order_type ∈ {medication, lab, imaging, followup}
description (≤20 words, physician’s phrasing)
reason (clinical justification, ≤20 words, or null if not present)
provenance (list of supporting turn IDs)

Orders no longer active but not explicitly renewed must be excluded. This rigor in order temporal grounding reflects real EHR integration requirements (Karim et al., 12 Oct 2025, Corbeil et al., 30 Oct 2025).

2. Dataset Construction and Annotation Protocol

Training and evaluation datasets for MEDIQA-OE-2025 were sourced from:

ACI-Bench: 207 naturalistic clinical consultations (10–30 minutes)
PriMock57: 57 high-quality simulated primary-care dialogues

Annotation employed medically trained annotators and enforced strict adherence to a four-field JSON schema. Inter-annotator agreement was measured at Cohen’s κ = 0.768. The entire dataset was split as follows:

Set	Conversations	Orders (Test)	Avg. Orders/Dialog
Training	64	—	—
Validation	100	—	—
Test	100	255	2.55

Approximate type distribution (test set): medications ~40%, labs ~30%, imaging ~20%, follow-up ~10% (Corbeil et al., 30 Oct 2025, Balachandran et al., 13 Nov 2025).

3. Evaluation Metrics and Leaderboard

Performance evaluation utilized a combination of field-specific and macro-averaged metrics:

Description and Reason: ROUGE-1 F₁ (unigram overlap between prediction and gold).
Order Type: Strict F₁ requiring exact match among four categories.
Provenance: Multi-label F₁ over sets of turn IDs.
Overall leaderboard score: Average of the above four F₁ metrics.

Precision, recall, and F₁ were defined as

$\text{Precision} = \frac{TP}{TP + FP}, \quad \text{Recall} = \frac{TP}{TP + FN}, \quad F_1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$

(Karim et al., 12 Oct 2025, Corbeil et al., 30 Oct 2025, Balachandran et al., 13 Nov 2025).

The upper bound (“match score”) is F₁ over order presence itself, irrespective of fields. In the 2025 shared task, field-level performance lagged ~15–18 points behind this upper bound, largely because of the inherent challenge in span recall and free-text summarization.

Top-3 leaderboard scores (macro-average F₁, test set) (Corbeil et al., 30 Oct 2025):

Team	Model	AVG F₁ (%)
WangLab	GPT-4	60.2
silver_shaw	Gemini 2.5 Pro	60.1
MISo KeaneBeanz	Qwen3 32B	53.4
EXL Health AI Lab	MedGemma 27B	50.9
MasonNLP	LLaMA-4 17B	37.8
HerTrials	LLaMA3.2 3.2B	15.9

Order type F₁ approached match upper bounds, indicating high reliability in categorical subtask when order detection succeeded. However, reason extraction performance remained comparatively weak, peaking at 41.3% (Corbeil et al., 30 Oct 2025).

4. Methodological Taxonomy and Prompting Strategies

Participants avoided full model fine-tuning due to data scarcity, instead favoring zero- or few-shot prompting with both closed- and open-weight LLMs. Notable strategies included:

Zero-shot prompting with heavy schema conditioning: Explicitly defining output schema, rules, and JSON-constrained decoding (as in WangLab and silver_shaw with GPT-4 and Gemini 2.5 Pro).
Few-shot prompting: One or two exemplars included in the prompt to illustrate format and desired mapping (e.g., MISo KeaneBeanz with Qwen3 32B, MasonNLP with LLaMA-4 17B) (Karim et al., 12 Oct 2025).
Reasoning-augmented prompting: Chain-of-thought, self-verification, and “thinking mode” multi-step prompts to induce more careful extraction and error checking (silver_shaw; EXL Health AI Lab) (Balachandran et al., 13 Nov 2025).
Agentic workflows: Simulating multi-agent stepwise decomposition into identification, mapping, structuring, validation; found to degrade performance on high-quality, manually annotated data due to propagation of minor errors and increased hallucination rates (Balachandran et al., 13 Nov 2025).

EXL Health AI Lab’s analysis with MedGemma demonstrated that straightforward one-shot prompting outperformed both ReAct and multi-step agentic workflows (Avg F₁: 0.436 vs. 0.277 and 0.111), suggesting simplicity is advantageous on clean, curated transcripts (Balachandran et al., 13 Nov 2025).

5. Model Architectures and Scale Effects

Both closed-weight models (e.g., GPT-4, Gemini 2.5 Pro) and open-weight models (e.g., Qwen3 32B, LLaMA-4 17B, MedGemma 27B) were employed, with no end-to-end clinical fine-tuning except targeted fine-tuning of MedGemma on clinical data for JSON generation (Balachandran et al., 13 Nov 2025). Results indicated:

Strong positive correlation between model size and extraction performance among open-weight models (Pearson ρ = 0.981).
MedGemma’s scaling from 4B to 27B parameters showed improved understanding of clinical dialogue structure and specialized language, with domain-specific pretraining enhancing handling of turn markers and speaker cues.
Prompt design and field-level schema specification contributed critical gains, especially one-shot exemplars for provenance and reason extraction (Karim et al., 12 Oct 2025).

Order extraction, especially reason identification, was not substantially improved by more complex reasoning chains on high-quality data, exposing an “analytical over-processing” phenomenon, where extra cognitive steps can amplify small errors (Balachandran et al., 13 Nov 2025).

6. Error Analysis and Bottlenecks

Key findings regarding system error profiles included:

Order counting as a primary bottleneck: Best systems achieved match F₁ ≈ 81.8%, indicating 18% of orders are missed or erroneously predicted.
Free-text field extraction lags categorical fields: Description, reason, and provenance F₁ were 15–18 points lower than order matching, highlighting challenges in fine-grained span selection and summarization.
Reason extraction remains the weakest link: Due to implicitness and dispersion of clinical justifications across conversation turns, peak reason F₁ was 41.3% (Corbeil et al., 30 Oct 2025).
Annotation noise limits ceiling: An inter-annotator agreement of κ = 0.768 sets a practical upper bound on achievable performance.
Complex prompting introduces error propagation: Multi-step or agentic prompting increases hallucination rates and field violations on well-annotated datasets (Balachandran et al., 13 Nov 2025).

7. Implications, Recommendations, and Future Directions

The MEDIQA-OE-2025 Shared Task establishes that:

Zero- and few-shot prompting of large LLMs can yield competitive baselines for structured extraction from clinical dialog, but non-trivial performance gaps remain—particularly in free-text fields and order recall.
Closed-weight LLMs currently lead; open-weight models show clear return to scale with parameter count.
Effective prompt engineering (e.g., explicit in-context exemplars, field constraints) can notably boost field fidelity—particularly provenance and reason—without further domain adaptation (Karim et al., 12 Oct 2025, Balachandran et al., 13 Nov 2025).
Recommendations include expanding annotated corpora (with possible synthetic augmentation), hybridizing span-based NER/RE with generation or retrieval, prompt engineering for small LLMs, and integrating schema validators (Corbeil et al., 30 Oct 2025).
Future shared tasks may address multilingual and multimodal (audio + transcript) extraction, with a focus on true end-to-end EHR population impact.

A plausible implication is that, given continued progress in LLM scale, domain adaptation, and prompt design, robust and safe extraction of clinical orders from health dialogues will be attainable, although further advances in annotation rigor, dataset scale, and free-text extractive reasoning are still necessary for reliable deployment (Corbeil et al., 30 Oct 2025, Karim et al., 12 Oct 2025, Balachandran et al., 13 Nov 2025).