MEDIQA-OE 2025 Shared Task

Updated 19 October 2025

MEDIQA-OE 2025 is a challenge that evaluates extracting structured medical orders from long clinical dialogues using advanced prompt engineering.
It leverages instruction-tuned large language models to convert unstructured patient-provider exchanges into standardized data fields including order type, description, reason, and provenance.
Comparative analysis and evaluation metrics highlight performance trade-offs and the critical role of schema-guided prompt design in improving extraction accuracy.

The MEDIQA-OE 2025 Shared Task is an evaluation challenge centered on the automatic extraction of structured medical orders from extended conversational clinical transcripts. As part of the broader MEDIQA initiative, the 2025 shared task advances methodologies for transforming unstructured multi-turn dialogues—including patient-provider exchanges—into actionable, structured clinical data for downstream applications such as decision support, documentation, and workflow automation.

1. Task Definition and Objectives

The shared task focuses on extracting medical orders, which may appear in heterogeneous sources such as electronic health records, discharge summaries, and lengthy doctor–patient dialogues. Each medical order is to be decomposed into four distinct fields:

Order Type: Categorical (medication, lab, imaging, followup)
Description: Textual phrase specifying the order
Reason: Clinical justification or motivation
Provenance: Set of turn identifiers (e.g., utterance indices) where the supporting evidence is found

The formal prediction setup is:

$\mathcal{D} = \{(t_i, s_i, u_i)\}_{i=1}^N, \quad \mathcal{O} = \{o_j\}_{j=1}^M$

where

$o_j = (\text{type}_j,\, \text{desc}_j,\, \text{reason}_j,\, \text{prov}_j)$

and $\text{prov}_j \subseteq \{1, 2, ..., N\}$ .

The core objective is the robust, accurate extraction of these fields across long-context dialogues without access to extensive domain adaptation.

2. Model Architectures and Implementation Strategies

The challenge featured a range of participant models; the MasonNLP submission exemplifies a lightweight yet effective approach:

Base Model: Meta’s LLaMA-4 Scout 17B, an instruction-tuned LLM with open weights and an 8,192-token context window. No domain-specific fine-tuning was applied.
Learning Setup: Few-shot configuration with a single in-context example, leveraging the LLM’s instruction adherence and context mechanisms.
Comparison Systems: Additional runs included LLaMA-3 8B in zero-shot mode, with clear F1 score improvements when using the larger, instruction-tuned LLaMA-4 17B model—even in zero-shot versus few-shot setups.

A plausible implication is that instruction tuning and effective prompt design compensate for lack of clinical-domain data, as shown by substantial gains over zero-shot smaller models.

3. Prompt Engineering and Pipeline Design

The efficacy of the extraction model was strongly linked to prompt construction and iterative refinement:

Role Instruction: The prompt assigns the model a clinical assistant role responsible for four-field medical order extraction.
Input Formatting: Dialogue is rendered line-wise, with turn IDs and speakers preserved: [turn_id] Speaker: Utterance.
Output Schema: Initial JSON attempts led to inconsistent outputs; final schema is a comma-separated line format, modeling correct field separation and order linkage.
Exemplar Inclusion: The few-shot configuration provides a single annotated exchange, reinforcing both schema adherence and the act of provenance citation.

Iterative prompt development enhanced the precision of the extracted fields—particularly for provenance mapping and reason assignment, which are challenging due to implicit references and evidence scattered across multiple dialogue turns.

4. Evaluation Metrics and Results

Performance was evaluated using field-specific metrics, averaged for overall system ranking:

Field	Metric	MasonNLP Few-Shot Score
Description	Rouge-1 F1	39.05%
Reason	Rouge-1 F1	19.78%
Order Type	Strict F1	50.91%
Provenance	MultiLabel F1	41.32%
Average	Mean	37.76%

The final score is given by:

$\text{Score} = \frac{\text{F1}_{desc} + \text{F1}_{reason} + \text{F1}_{type} + \text{F1}_{prov}}{4}$

An ablation across model size and prompt strategy confirmed that few-shot prompting with a larger LLM delivered tangible gains, especially in provenance detection.

Competitor systems achieved higher scores (e.g., WangLab and silver_shaw at ~60%), suggesting room for improvement in capturing implicit reasoning and multi-turn evidence chains.

5. Comparative Analysis of Approaches

The MasonNLP pipeline foregrounds the strategy of using a general-purpose, instruction-tuned LLM without fine-tuning:

Advantages: This approach delivers competitive results while minimizing resource requirements and the need for large domain-specific datasets.
Limitations: Accuracy in extracting the reason and comprehensive provenance fields lags behind specialized models, reflecting the difficulty of modeling implicit, distributed clinical rationale within conversational data.
Potential Enhancements: Incorporation of retrieval augmentation (e.g., RAG) and targeted domain adaptation could address these weaknesses by improving evidence aggregation and context tracking.

The variation in results among participating teams illustrates the trade-offs between model specialization and generalization, and the critical importance of schema grounding in prompt engineering for clinical NLP tasks.

6. Implications and Future Research Directions

Key findings and implications for clinical NLP include:

General-Purpose LLM Viability: Large, instruction-tuned models such as LLaMA-4 can effectively serve as scalable baselines for specialized extraction when combined with structured input formats and minimal exemplar guidance.
Context Window Utility: The ability to process 8,192 tokens enables handling extended clinical dialogues with multiple orders and long chains of reasoning.
Prompt Engineering Impact: Schema-constrained prompt design, coupled with exemplification, is essential for high-fidelity extraction in the absence of domain-specific supervision.
Research Gaps: Future directions should focus on enriching the model’s ability to parse and aggregate implicit clinical reasoning, as well as robust, multi-turn evidence collection.
Adoption Barriers: The need for precise provenance and justification extraction remains a critical bottleneck; ongoing research may integrate retrieval-based models and domain-adaptive training to close this gap.

The MEDIQA-OE 2025 Shared Task demonstrates that prompt-guided, large general-purpose LLMs are practical baselines for conversational medical order extraction, but also delineates the challenges that remain, particularly in the areas of implicit reasoning and provenance mapping (Karim et al., 12 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

Assessing Large Language Models for Structured Medical Order Extraction (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to MEDIQA-OE 2025 Shared Task.