SciREX: Document-Level IE Benchmark

Updated 3 November 2025

SciREX is a document-level IE benchmark designed to extract N-ary relations and salient entities from full-length scientific papers by aggregating dispersed information.
It employs a hybrid annotation approach that combines distant supervision, automated BERT+CRF mention detection, and expert corrections to ensure high-quality labeling.
Baseline models reveal significant challenges in full-document extraction, with notably low F1 scores for complex 4-ary relations underscoring the need for advanced techniques.

The SciREX dataset is a large-scale resource for document-level information extraction (IE) in scientific literature, specifically targeting the extraction of entities and their salient N-ary (not just binary) relationships from entire scholarly articles. Unlike most prior datasets which focus on sentence- or paragraph-level relations, SciREX emphasizes extraction at the document level, requiring comprehensive modeling of long, structurally complex scientific documents where relevant information is often dispersed across sections.

1. Motivation and Design Principles

The motivation behind SciREX is rooted in the observation that critical scientific facts and relationships—particularly the main contributions and results of scholarly articles—frequently span multiple sentences and sections, making them inaccessible to models and datasets designed for local (sentence/paragraph) contexts. Existing resources such as SciERC are limited to abstracts, ignoring the intricacy of whole-document reasoning. SciREX addresses the need for a benchmark that supports:

Annotation of entities and their N-ary relations across entire documents
Evaluation of models’ capabilities in aggregating distributed scientific evidence
Development of document-level IE systems, moving beyond sentence-based extraction

This approach is positioned to challenge and advance the state of the art in scientific information extraction and accurate knowledge base construction from unstructured scientific texts (Jain et al., 2020).

2. Construction and Annotation Methodology

SciREX comprises 438 full scientific papers drawn primarily from machine learning, with extensive manual and automatic annotation. The labeling process is divided into several steps:

Distant Supervision: Utilize Papers with Code (PwC) as a knowledge base to identify result tuples present in the paper (e.g., (Dataset, Metric, Method, Task, Score)). These tuples provide weak labels but not textual spans.
Text Preprocessing: Process arXiv PDFs to segment into sections and sentences using LaTeXML and Grobid.
Automatic Mention Detection: A BERT+CRF span tagger, trained on SciERC, generates candidate mention spans and entity types.
Entity Linking: Match candidate mentions to PwC entities using Jaccard similarity with a high-recall threshold.
Human Annotation: Experts correct spans, types, add missing salient mentions, delete noisy predictions, and resolve entity links. Detailed annotation guidelines ensure high inter-annotator agreement (Cohen’s κ ≈ 0.95).

This hybrid workflow achieves both annotation quality and scalability: 83% of automatic mention labels are correct, with humans adding ~15% new mentions (mainly salient entities) (Jain et al., 2020).

3. Supported Tasks and Dataset Structure

SciREX enables benchmarking of the following document-level IE tasks:

Entity Recognition: Identification and typing of entity mentions as Method, Dataset, Task, or Metric.
Salient Entity Identification: Distinguishing entities central to the paper’s key results (not every mention).
Coreference Resolution: Grouping of mentions (including paraphrases and abbreviations) referring to the same entity.
N-ary Relation Extraction: Predicting relations (tuples), especially the 4-ary tuple: (Dataset, Metric, Task, Method), typically spanning multiple sentences and/or sections.

The dataset structure is summarized as follows:

Statistic	SciREX
Documents	438
Avg. words/doc	5,737
Avg. sections/doc	22
Avg. entity mentions/doc	360
Avg. salient entities/doc	8
Avg. 4-ary relations/doc	5

The majority of N-ary relations (99% of 4-ary, 57% of binary) cross sentence boundaries; 55% of 4-ary relations also cross section boundaries (Jain et al., 2020).

4. Baseline Models and Evaluation Protocols

SciREX provides baselines for each subtask, using a joint, end-to-end neural IE architecture:

Document Encoding: Each section is encoded with SciBERT; section embeddings are passed to a BiLSTM for document-level context propagation.
Mention Identification: A BIOUL-CRF tagger identifies and types spans.
Saliency Classification: A feed-forward network distinguishes salient from non-salient mentions.
Coreference Resolution: A pairwise classifier (using SciBERT [CLS] token) with agglomerative clustering builds entity clusters.
Relation Extraction: All possible (Dataset, Metric, Task, Method) cluster tuples are considered; section-level and document-level tuple features are aggregated and fed into a feed-forward classifier.
Training: Joint loss for mention, saliency, and relation subtasks; coreference is trained separately.

Evaluation metrics:

Macro-F1 for mention/entity recognition (exact match per type)
Binary F1 for saliency and coreference (clusters match if >50% mentions overlap)
F1 for relation extraction, matching predicted tuples to gold using cluster mapping

Strong inter-annotator agreement contrasts with model F1, which remains low for downstream tasks, especially 4-ary relation extraction (F1=0.008 predicted input, up to 0.27 with gold clusters), signifying persistent challenges in document-level aggregation and error propagation (Jain et al., 2020). The baseline notably surpasses sequence or paragraph-limited models (e.g., DyGIE++), underlining the difficulty of the document-level setting.

5. Subsequent Benchmarking and Model Enhancements

SciREX serves as the principal benchmark for later document-level IE advances:

Template-based Generative Models: TempGen (Huang et al., 2021) reformulates N-ary relation extraction as direct template generation via BART, introducing cross-attention TopK Copy to improve mention copying and structural fidelity, achieving F1=3.55 for 4-ary relations (vs. baseline 0.8). This mechanism robustly addresses attention head noise.
Iterative Extraction via Imitation Learning: ITERX (Chen et al., 2022) introduces a Markov decision process and imitation learning, extracting templates one at a time using a dynamic oracle and memory-augmented policy, achieving F1=16.9 for CEAF-RME on 4-ary extractions. This strategy outperforms prior seq2seq and pipeline models, particularly in handling multiple/zero templates and template alignment.

These results reinforce the observation that full-document IE poses persistent modeling challenges, with main bottlenecks in cluster saliency and cross-section reasoning.

6. Comparative Context and Impact

SciREX remains the only resource—prior to SciER (Zhang et al., 28 Oct 2024)—annotating full-length scientific papers for N-ary relations and salient entity clusters. However, SciREX does not annotate explicit directed relations among all pairs or provide a fine-grained relation taxonomy, focusing instead on salient result tuples and the cross-document coreference/grounding task.

In comparison, the SciER dataset (Zhang et al., 28 Oct 2024) offers full-text annotation for both entities and a nine-type relation schema (e.g., TRAINED-WITH, EVALUATED-WITH), thus broadening the set of extractable knowledge. SciER, however, is smaller in document count but more granular at the relational level. A plausible implication is that future benchmarking may shift toward end-to-end knowledge graph construction, with SciREX providing the pivotal bridge from abstract-level to full-document, entity/relation-rich resources.

7. Availability and Research Usage

SciREX data and baseline code are fully accessible at https://github.com/allenai/SciREX, supporting direct comparison of systems as well as further annotation or domain extension. Detailed annotation guidelines and preprocessing scripts are likewise provided, ensuring reproducibility and extensibility for research in large-scale, document-level scientific information extraction (Jain et al., 2020).

References:

SciREX: A Challenge Dataset for Document-Level Information Extraction (Jain et al., 2020)
Iterative Document-level Information Extraction via Imitation Learning (Chen et al., 2022)
Document-level Entity-based Extraction as Template Generation (Huang et al., 2021)
SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents (Zhang et al., 28 Oct 2024)