Papers
Topics
Authors
Recent
Search
2000 character limit reached

BioPIE: Biomedical Protocol Extraction Dataset

Updated 15 January 2026
  • BioPIE is a large-scale, procedure-centric dataset that annotates biomedical experimental protocols using detailed knowledge graphs to map entities, parameters, and interrelations.
  • It addresses high information density and multi-step reasoning by precisely extracting dense operational parameters like volume, temperature, and timing.
  • The dataset powers advanced QA systems and lab automation by linking textual protocols with structured representations for real-world biomedical applications.

The Biomedical Protocol Information Extraction Dataset (BioPIE) is a large-scale, procedure-centric resource designed for fine-grained extraction and reasoning over biomedical experimental protocols. BioPIE provides structured knowledge graphs (KGs) detailing entities, parameters, and interrelations within experimental workflows, thereby enabling advanced question answering (QA), information extraction (IE), and powering downstream applications such as laboratory automation. The dataset addresses two principal challenges in biomedical QA: High Information Density (HID) and Multi-Step Reasoning (MSR), facilitating high-fidelity comprehension and automation of experimental procedures (Hou et al., 8 Jan 2026).

1. Motivation and Problem Definition

BioPIE was developed to address the deficits in existing biomedical IE datasets, which are either overly coarse-grained or domain-narrow, failing to support the nuanced, step-specific reasoning required for protocol-centric applications. HID refers to the dense packaging of operational parameters—such as volume, temperature, and timing—within individual sentences. MSR denotes the frequent necessity to integrate information across multiple steps to resolve protocol-related queries. The specific goals of BioPIE are as follows:

  • Provide a cross-disciplinary dataset of annotated biomedical protocols, supporting diverse experimental domains.
  • Enable automated construction of step-level KGs to capture HID and support MSR in procedural reasoning.
  • Serve as a foundation for both QA systems and downstream automation in laboratory settings.

2. Annotation Schema and Graph Representation

BioPIE adopts a detailed, procedure-centric annotation schema, yielding directed graphs for each protocol step. The annotation scope is intentionally broad, allowing coverage across cell culture, sequencing, microscopy, biomaterials, and related areas.

2.1 Entity Types (34 total)

  • Actions & Processes: verb, process, method
  • Materials & Consumables: chemical, protein, polymer, biomaterial, consumable, blend
  • Containers & Instruments: container, device, software
  • Parameters: volume, temperature, time, force, concentration, speed, energy, mass, size, length, repetitions
  • Biological Objects: cell, nucleic acid, organ, animal, plant
  • Other: part, state, environment, data, number, position

2.2 Relation Types (21 total)

  • Action–Object: is_object_of, contain
  • Action–Parameter: have_parameter, have_property, repetitions
  • Resource Usage: use_device, use_reagent, apply_material, use_method, use_software
  • Procedural Logic: next_step, for_each, to, from, during, based_on, or, not, equal, in_condition_of, is_goal_of

Each annotated protocol step produces a graph representation:

g=(V,E,{Tn},{Te})g = (V, E, \{T_n\}, \{T_e\})

where VV are entity mentions, EE are typed relations, and TnT_n, TeT_e are textual node and edge attributes.

3. Construction, Normalization, and Annotation

3.1 Protocol Collection and Normalization

BioPIE comprises 509 sub-protocols, aggregated from open-access sources—Nature Protocols, Cell STAR Protocols, Bio-Protocol, Wiley Current Protocols, and Jove—all under Creative Commons licensing. Paragraphs are normalized into stepwise imperative sentences using the Qwen-max model, followed by human review.

3.2 Annotation Workflow

Two expert annotators (backgrounds in computer science and biomedical research) independently annotate all protocols using a longest-span strategy. Nested spans are supported for precise relation mapping. Annotations are resolved via consensus, and quality control is enforced:

  • Annotation tool: BRAT
  • Inter-annotator agreement (Cohen’s κ\kappa): κentities=0.7920\kappa_\text{entities} = 0.7920, κrelations=0.6826\kappa_\text{relations} = 0.6826

3.3 Data Split and Metrics

  • In-domain (ID) protocols: 464
  • Out-of-domain (OOD) held-out protocols: 45
  • Standard metrics are used: precision, recall, and F1F_1 score.

4. Dataset Statistics and Resource Comparison

BioPIE offers extensive coverage and granularity relative to existing IE resources. Key statistics are:

Dataset Entity Types Relation Types Entities Relations Sentences Rel/Sent
ACE2005 7 6 38,287 7,070 10,372 0.68
SciERC 6 7 8,089 4,716 2,679 1.76
ChemPort 1 13 17,340 10,065 7,552 1.33
BioPIE 34 21 10,982 8,848 1,916 4.62

BioPIE’s average of 4.62 relations per sentence underscores its high information density, with a much richer schema for entities and relations.

5. Information Extraction Methodologies and Benchmarks

5.1 Task Definitions

  • NER: Identify entity spans ej=wl...wre_j = w_l...w_r and type tjEt_j \in E.
  • RE: For each entity pair (ej,ek)(e_j, e_k), predict rjkR{NULL}r_{jk} \in R \cup \{\mathrm{NULL}\}.

5.2 Baselines and Evaluation

Supervised and LLM-based approaches were benchmarked:

  • Supervised: PL-Marker (span classification pipeline), HGERE (joint hypergraph neural extraction).
  • LLMs: GPT-5, Claude-4.5, Llama-4, Qwen-max (zero- or few-shot); LoRA fine-tuned Llama-3-8B and Qwen-3-7B.
  • Training: Supervised with SciBERT encoder; LLMs use k[1,20]k \in [1,20] in-context examples via sentence embeddings.

Key IE Results (In-domain test):

Model NER-F1F_1 Rel F1F_1 Rel+^+ F1F_1 RE (gold NER)
PL-Marker (sup.) 87.40 82.55 74.52 87.88
HGERE (sup.) 87.63 82.10 73.93
Claude-4.5 (few) 85.18 67.87 63.47 77.20
Qwen-3-7B (LoRA) 82.92 70.70 62.96
Llama-3-8B (LoRA) 86.33 75.28 68.13 81.67

While supervised models outperform LLMs on relation extraction (RE) especially in-domain, LoRA fine-tuning approaches near-supervised results and exhibits improved out-of-domain robustness. Few-shot LLMs reduce the performance gap with 5–15 examples.

6. QA System Architecture and Empirical Results

6.1 Pipeline Overview

The QA system leverages extracted (sentence,graph)(\text{sentence}, \text{graph}) pairs and a hybrid retrieval–generation pipeline:

  1. Extract all paired data from 4,813 protocols.
  2. Rank (si,gi)(s_i, g_i) for query qq by:

R(q,si,gi)=Rt(q,si)log(1+Rg(q,gi))R(q, s_i, g_i) = R_t(q, s_i) \cdot \log(1 + R_g(q, g_i))

where RtR_t is BM25 text relevance and RgR_g counts graph entities in the query.

  1. Serialize top-KK pairs for LM context input: [q;si1;Tni1,Tei1;...][q; s_{i_1}; T_{n_{i_1}}, T_{e_{i_1}}; ...].
  2. Generation by a LLM θ\theta:

pθ(Ys^,g^)=tpθ(yty<t,[q,s^,g^])p_\theta(Y \mid \hat s, \hat g) = \prod_t p_\theta(y_t \mid y_{<t}, [q, \hat s, \hat g])

6.2 Dataset Structure

  • 4,813 sub-protocols and QA pairs
  • Train/dev/test split: 2,900/250/1,663
  • HID set: 230 questions (avg. 10.4 relations/sentence)
  • MSR set: 179 questions (2\geq2 reasoning steps)

6.3 QA Baselines and Ablations

  • LLM only (no retrieval)
  • Text RAG (BM25, LaBSE, embedding models)
  • Graph RAG (GRAG)
  • Ablations: without sentence, without graph, using SciERC or ChemPort graphs

6.4 Accuracy Results (Open-source LLM)

System Test HID MSR
LLM only 19.12 18.30 17.88
BM25 + LLM 62.24 59.57 49.72
GRAG + LLM 23.63 10.64 6.70
Ours w/o Graph 65.42 62.98 54.19
Ours w/o Sentence 64.88 67.23 56.42
Ours w SciERC 62.54 60.85 55.31
Ours w ChemPort 63.92 65.11 55.87
Ours (full) 70.66 69.36 62.01

BioPIE’s full system outperforms all baselines (+8.4 pts overall vs. BM25). Removing graph information or replacing with domain-oblivious graphs significantly reduces performance, indicating the necessity and complementarity of text and protocol-specific KGs. Closed-source LLMs such as Claude-4.5-Haiku yield up to 89.6% accuracy and show analogous trends.

7. Significance, Applications, and Outlook

BioPIE’s procedure-centric KGs directly model HID parameters (e.g., volume, temperature, force, time) and sequence order via next_step relations, enabling explicit encoding essential for HID and MSR QA. QA performance on HID (69.36%) and MSR (62.01%) tasks demonstrates BioPIE’s practical advantage over prior datasets and generic approaches. Diminishing returns are observed for relation extraction (saturating near 75 F1F_1), while entity extraction continues to benefit from scaling.

Beyond QA, BioPIE KGs facilitate:

  • Protocol synthesis and optimization
  • Automated workflow validation (e.g., parameter consistency checking)
  • Programmatic transformation of protocols into robotic scripts
  • Modular and conditional orchestration in AI-powered or autonomous laboratories

BioPIE establishes a foundation for both improved near-term QA over biomedical protocols and longer-term AI-driven laboratory automation by furnishing high-resolution, procedure-focused experimental knowledge graphs (Hou et al., 8 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Biomedical Protocol Information Extraction Dataset (BioPIE).