ZSEE Dataset: Zeolite Synthesis Event Extraction

Updated 24 December 2025

ZSEE Dataset is a curated benchmark that annotates zeolite synthesis sentences with detailed event types, triggers, and argument roles.
It employs a unified JSON schema to capture specific synthesis steps such as reagent addition, stirring, and calcination with high precision.
The dataset benchmarks both specialized models and LLMs, revealing strengths in event classification and limitations in argument extraction.

The ZSEE (Zeolite Synthesis Event Extraction) dataset is a curated and expertly annotated benchmark designed to facilitate scientific information extraction from zeolite synthesis experimental procedures. Integrating a comprehensive annotation schema, it emphasizes event-centric extraction tasks and is widely used to evaluate both specialized event-extraction models and LLMs across prompting strategies. ZSEE comprises 1,530 sentences selected from peer-reviewed journal articles, precisely labeled for event types, triggering spans, argument roles, and argument text spans, and encoded within a unified JSON schema (Rathore et al., 17 Dec 2025).

1. Construction and Curation Protocol

ZSEE's corpus consists of 1,530 sentences sourced from the experimental procedures in peer-reviewed publications on zeolite synthesis, with initial sentence collection attributed to He et al. (2024). Sentence selection aimed for comprehensive coverage of synthesis operations, including but not limited to reagent addition, stirring, heating, washing, and related steps. Preprocessing was restricted to standard tokenization and basic regularization (e.g., Unicode normalization, bracket correction) to ensure valid JSON outputs for downstream LLM evaluations. No domain-specific filtering or unconventional sentence splitting was performed. The dataset is solely composed of sentences, not full procedures.

2. Annotation Schema and Event Taxonomy

Information extraction with ZSEE is cast as four formal subtasks, each utilizing a structured annotation protocol:

Event type classification: Each sentence is matched against 16 rigorously defined synthesis event classes.
Trigger text identification: For each detected event, the exact text span (token-level) that signals the event is determined.
Argument role extraction: Each participant or process parameter is mapped to one of 13 argument-role categories.
Argument text extraction: For each (event, role) tuple, the annotator extracts the corresponding text span from the sentence.

Event types and their roles, as presented in the zero-shot prompt, include "Add" (material is introduced; roles: material, temperature, container), "Stir" (mechanical agitation; roles: duration, temperature, revolution, sample), "Calcine" (high-temperature treatment; roles: duration, temperature, etc.), among others, with event definitions covering fine-grained procedural variety. Argument roles are operationalized to capture numeric values (temperature, duration), entities (material, container, sample), and qualitative or operational descriptors (condition, revolution_text). Argument-role definitions are provided in the annotation guidelines for annotator consistency. The annotation schema is formalized in strict JSON, supporting programmatic evaluation and model training.

3. Annotation Process and Quality Control

ZSEE annotation was performed by domain experts instructed to extract only information explicitly present in each sentence. Inferential annotation was explicitly excluded. Each annotation instance conforms to a JSON structure encoding event types, their triggers, and the set of argument-role text spans per event. Example-driven guidelines, including negative controls (e.g., not labeling an event when only referred to as context), were established to minimize systematic error. According to He et al. (2024), two annotators reviewed each sentence, with subsequent adjudication; inter-annotator agreement reportedly exceeds 85% F1 on trigger and role extraction. Quality control involved periodic spot checks and mandatory correction of any outputs not conforming to the schema.

4. Quantitative Characterization

The dataset supports detailed quantitative analysis:

Sentence and event statistics: 1,530 sentences with a mean of 1.8 events per sentence.
Argument density: Average of 2.3 argument mentions per event.
Event type distribution: Add, Stir, and Calcine collectively account for approximately 45% of all event mentions (each about 15%). Rarer events, such as Rotate, Seal, and Sonicate, constitute less than 3% each.
Argument-role occurrence: "material" comprises roughly 40% of argument annotations, followed by "temperature" (25%) and "duration" (20%). Rare argument roles (such as "revolution_text", "times") appear in fewer than 5% of mentions.
Data splits: All 1,530 sentences were used as a held-out evaluation set for prompting experiments, with no pre-defined train/validation/test partitions. Comparisons to published PAIE and Zero-Reader model results utilized the full dataset in alignment.

5. Evaluation Tasks and Metrics

ZSEE is deployed to benchmark performance on all four subtasks simultaneously, emphasizing evaluation of both generalized and specialized extraction architectures. The principal metrics are based on lemmatization-based subset matching between gold and predicted spans/labels for each subtask:

Precision $P = \frac{\text{number of correct predictions}}{\text{number of predicted items}}$
Recall $R = \frac{\text{number of correct predictions}}{\text{number of gold items}}$
F1 Score $= \frac{2PR}{P+R}$

The table below (excerpted from (Rathore et al., 17 Dec 2025)) summarizes typical F1 scores for a representative LLM under zero-shot versus few-shot prompting:

Subtask	Zero-Shot	Few-Shot
Event Type	86.5	85.4
Trigger Text	70.4	87.6
Argument Roles	64.3	74.7
Argument Texts	56.5	65.9

Performance benchmarks for specialized models include PAIE with 92% F1 (event-type + trigger) and 74% F1 on arguments. Zero-Reader provides comparable event detection, with slight gains on arguments attributed to contrastive learning.

6. Applications and Key Findings

ZSEE functions as a standard evaluation resource for the domain of scientific event extraction, with primary emphasis on the zeolite synthesis literature. It underpins systematic studies of LLM prompting strategies, including zero-shot, few-shot, event-specific, and reflection-prompted modalities. Results indicate that general-purpose LLMs approach specialized models in coarse event classification (82–89% F1) but consistently underperform in argument role and text extraction (falling short by ~10–15 F1 points). GPT-5-mini, in particular, displays prompt sensitivity ranging from 11–79% F1 for trigger extraction depending on prompt design. Advanced prompting yields minimal gains over zero-shot baselines, highlighting current LLM architectural limitations for domain-specific, fine-grained scientific extraction. Error analyses document systematic hallucinations, over-generalizations, and difficulties in capturing synthesis nuances.

7. Significance and Ongoing Development

ZSEE fills a critical gap in scientific information extraction benchmarks by focusing on the nuanced procedural text inherent to synthetic chemistry, specifically zeolite synthesis. Its rigorous expert annotation, multi-level schema, and commitment to explicit extraction pave the way for transparent evaluation and ablation studies in the scientific NLP domain. The absence of standard data splits and the complex schema reflect both the dataset's strengths (comprehensive coverage, strict quality) and open challenges (generalizability, model adaptation). ZSEE has been instrumental in establishing quantitative benchmarks that clarify the capabilities and limitations of contemporary LLMs and specialized architectures for event-centric extraction in the scientific literature (Rathore et al., 17 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Evaluating LLMs for Zeolite Synthesis Event Extraction (ZSEE): A Systematic Analysis of Prompting Strategies (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ZSEE Dataset.