BioPIE: Biomedical Protocol Extraction Dataset

Updated 15 January 2026

BioPIE is a large-scale, procedure-centric dataset that annotates biomedical experimental protocols using detailed knowledge graphs to map entities, parameters, and interrelations.
It addresses high information density and multi-step reasoning by precisely extracting dense operational parameters like volume, temperature, and timing.
The dataset powers advanced QA systems and lab automation by linking textual protocols with structured representations for real-world biomedical applications.

The Biomedical Protocol Information Extraction Dataset (BioPIE) is a large-scale, procedure-centric resource designed for fine-grained extraction and reasoning over biomedical experimental protocols. BioPIE provides structured knowledge graphs (KGs) detailing entities, parameters, and interrelations within experimental workflows, thereby enabling advanced question answering (QA), information extraction (IE), and powering downstream applications such as laboratory automation. The dataset addresses two principal challenges in biomedical QA: High Information Density (HID) and Multi-Step Reasoning (MSR), facilitating high-fidelity comprehension and automation of experimental procedures (Hou et al., 8 Jan 2026).

1. Motivation and Problem Definition

BioPIE was developed to address the deficits in existing biomedical IE datasets, which are either overly coarse-grained or domain-narrow, failing to support the nuanced, step-specific reasoning required for protocol-centric applications. HID refers to the dense packaging of operational parameters—such as volume, temperature, and timing—within individual sentences. MSR denotes the frequent necessity to integrate information across multiple steps to resolve protocol-related queries. The specific goals of BioPIE are as follows:

Provide a cross-disciplinary dataset of annotated biomedical protocols, supporting diverse experimental domains.
Enable automated construction of step-level KGs to capture HID and support MSR in procedural reasoning.
Serve as a foundation for both QA systems and downstream automation in laboratory settings.

2. Annotation Schema and Graph Representation

BioPIE adopts a detailed, procedure-centric annotation schema, yielding directed graphs for each protocol step. The annotation scope is intentionally broad, allowing coverage across cell culture, sequencing, microscopy, biomaterials, and related areas.

2.1 Entity Types (34 total)

Actions & Processes: verb, process, method
Materials & Consumables: chemical, protein, polymer, biomaterial, consumable, blend
Containers & Instruments: container, device, software
Parameters: volume, temperature, time, force, concentration, speed, energy, mass, size, length, repetitions
Biological Objects: cell, nucleic acid, organ, animal, plant
Other: part, state, environment, data, number, position

2.2 Relation Types (21 total)

Action–Object: is_object_of, contain
Action–Parameter: have_parameter, have_property, repetitions
Resource Usage: use_device, use_reagent, apply_material, use_method, use_software
Procedural Logic: next_step, for_each, to, from, during, based_on, or, not, equal, in_condition_of, is_goal_of

Each annotated protocol step produces a graph representation:

$g = (V, E, \{T_n\}, \{T_e\})$

where $V$ are entity mentions, $E$ are typed relations, and $T_n$ , $T_e$ are textual node and edge attributes.

3. Construction, Normalization, and Annotation

3.1 Protocol Collection and Normalization

BioPIE comprises 509 sub-protocols, aggregated from open-access sources—Nature Protocols, Cell STAR Protocols, Bio-Protocol, Wiley Current Protocols, and Jove—all under Creative Commons licensing. Paragraphs are normalized into stepwise imperative sentences using the Qwen-max model, followed by human review.

3.2 Annotation Workflow

Two expert annotators (backgrounds in computer science and biomedical research) independently annotate all protocols using a longest-span strategy. Nested spans are supported for precise relation mapping. Annotations are resolved via consensus, and quality control is enforced:

Annotation tool: BRAT
Inter-annotator agreement (Cohen’s $\kappa$ ): $\kappa_\text{entities} = 0.7920$ , $\kappa_\text{relations} = 0.6826$

3.3 Data Split and Metrics

In-domain (ID) protocols: 464
Out-of-domain (OOD) held-out protocols: 45
Standard metrics are used: precision, recall, and $F_1$ score.

4. Dataset Statistics and Resource Comparison

BioPIE offers extensive coverage and granularity relative to existing IE resources. Key statistics are:

Dataset	Entity Types	Relation Types	Entities	Relations	Sentences	Rel/Sent
ACE2005	7	6	38,287	7,070	10,372	0.68
SciERC	6	7	8,089	4,716	2,679	1.76
ChemPort	1	13	17,340	10,065	7,552	1.33
BioPIE	34	21	10,982	8,848	1,916	4.62

BioPIE’s average of 4.62 relations per sentence underscores its high information density, with a much richer schema for entities and relations.

5. Information Extraction Methodologies and Benchmarks

5.1 Task Definitions

NER: Identify entity spans $e_j = w_l...w_r$ and type $t_j \in E$ .
RE: For each entity pair $(e_j, e_k)$ , predict $r_{jk} \in R \cup \{\mathrm{NULL}\}$ .

5.2 Baselines and Evaluation

Supervised and LLM-based approaches were benchmarked:

Supervised: PL-Marker (span classification pipeline), HGERE (joint hypergraph neural extraction).
LLMs: GPT-5, Claude-4.5, Llama-4, Qwen-max (zero- or few-shot); LoRA fine-tuned Llama-3-8B and Qwen-3-7B.
Training: Supervised with SciBERT encoder; LLMs use $k \in [1,20]$ in-context examples via sentence embeddings.

Key IE Results (In-domain test):

Model	NER- $F_1$	Rel $F_1$	Rel $^+$ $F_1$	RE (gold NER)
PL-Marker (sup.)	87.40	82.55	74.52	87.88
HGERE (sup.)	87.63	82.10	73.93	–
Claude-4.5 (few)	85.18	67.87	63.47	77.20
Qwen-3-7B (LoRA)	82.92	70.70	62.96	–
Llama-3-8B (LoRA)	86.33	75.28	68.13	81.67

While supervised models outperform LLMs on relation extraction (RE) especially in-domain, LoRA fine-tuning approaches near-supervised results and exhibits improved out-of-domain robustness. Few-shot LLMs reduce the performance gap with 5–15 examples.

6. QA System Architecture and Empirical Results

6.1 Pipeline Overview

The QA system leverages extracted $(\text{sentence}, \text{graph})$ pairs and a hybrid retrieval–generation pipeline:

Extract all paired data from 4,813 protocols.
Rank $(s_i, g_i)$ for query $q$ by:

$R(q, s_i, g_i) = R_t(q, s_i) \cdot \log(1 + R_g(q, g_i))$

where $R_t$ is BM25 text relevance and $R_g$ counts graph entities in the query.

Serialize top- $K$ pairs for LM context input: $[q; s_{i_1}; T_{n_{i_1}}, T_{e_{i_1}}; ...]$ .
Generation by a LLM $\theta$ :

$p_\theta(Y \mid \hat s, \hat g) = \prod_t p_\theta(y_t \mid y_{<t}, [q, \hat s, \hat g])$

6.2 Dataset Structure

4,813 sub-protocols and QA pairs
Train/dev/test split: 2,900/250/1,663
HID set: 230 questions (avg. 10.4 relations/sentence)
MSR set: 179 questions ( $\geq2$ reasoning steps)

6.3 QA Baselines and Ablations

LLM only (no retrieval)
Text RAG (BM25, LaBSE, embedding models)
Graph RAG (GRAG)
Ablations: without sentence, without graph, using SciERC or ChemPort graphs

6.4 Accuracy Results (Open-source LLM)

System	Test	HID	MSR
LLM only	19.12	18.30	17.88
BM25 + LLM	62.24	59.57	49.72
GRAG + LLM	23.63	10.64	6.70
Ours w/o Graph	65.42	62.98	54.19
Ours w/o Sentence	64.88	67.23	56.42
Ours w SciERC	62.54	60.85	55.31
Ours w ChemPort	63.92	65.11	55.87
Ours (full)	70.66	69.36	62.01

BioPIE’s full system outperforms all baselines (+8.4 pts overall vs. BM25). Removing graph information or replacing with domain-oblivious graphs significantly reduces performance, indicating the necessity and complementarity of text and protocol-specific KGs. Closed-source LLMs such as Claude-4.5-Haiku yield up to 89.6% accuracy and show analogous trends.

7. Significance, Applications, and Outlook

BioPIE’s procedure-centric KGs directly model HID parameters (e.g., volume, temperature, force, time) and sequence order via next_step relations, enabling explicit encoding essential for HID and MSR QA. QA performance on HID (69.36%) and MSR (62.01%) tasks demonstrates BioPIE’s practical advantage over prior datasets and generic approaches. Diminishing returns are observed for relation extraction (saturating near 75 $F_1$ ), while entity extraction continues to benefit from scaling.

Beyond QA, BioPIE KGs facilitate:

Protocol synthesis and optimization
Automated workflow validation (e.g., parameter consistency checking)
Programmatic transformation of protocols into robotic scripts
Modular and conditional orchestration in AI-powered or autonomous laboratories

BioPIE establishes a foundation for both improved near-term QA over biomedical protocols and longer-term AI-driven laboratory automation by furnishing high-resolution, procedure-focused experimental knowledge graphs (Hou et al., 8 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

BioPIE: A Biomedical Protocol Information Extraction Dataset for High-Reasoning-Complexity Experiment Question Answer (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Biomedical Protocol Information Extraction Dataset (BioPIE).

BioPIE: Biomedical Protocol Extraction Dataset

1. Motivation and Problem Definition

2. Annotation Schema and Graph Representation

2.1 Entity Types (34 total)

2.2 Relation Types (21 total)

3. Construction, Normalization, and Annotation

3.1 Protocol Collection and Normalization

3.2 Annotation Workflow

3.3 Data Split and Metrics

4. Dataset Statistics and Resource Comparison

5. Information Extraction Methodologies and Benchmarks

5.1 Task Definitions

5.2 Baselines and Evaluation

6. QA System Architecture and Empirical Results

6.1 Pipeline Overview

6.2 Dataset Structure

6.3 QA Baselines and Ablations

6.4 Accuracy Results (Open-source LLM)

7. Significance, Applications, and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

BioPIE: Biomedical Protocol Extraction Dataset

1. Motivation and Problem Definition

2. Annotation Schema and Graph Representation

2.1 Entity Types (34 total)

2.2 Relation Types (21 total)

3. Construction, Normalization, and Annotation

3.1 Protocol Collection and Normalization

3.2 Annotation Workflow

3.3 Data Split and Metrics

4. Dataset Statistics and Resource Comparison

5. Information Extraction Methodologies and Benchmarks

5.1 Task Definitions

5.2 Baselines and Evaluation

6. QA System Architecture and Empirical Results

6.1 Pipeline Overview

6.2 Dataset Structure

6.3 QA Baselines and Ablations

6.4 Accuracy Results (Open-source LLM)

7. Significance, Applications, and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research