Document-Level Relation Extraction

Updated 23 December 2025

Document-level relation extraction is a method that identifies semantic relationships between entities across entire documents by leveraging inter-sentence context.
It integrates heterogeneous graph models, transformer-based encoding, and LLM modular prompts to address challenges like coreference resolution and multi-label imbalances.
Practical applications include automated knowledge base construction and enhanced multi-hop reasoning, crucial for processing large, unstructured texts.

Document-level relation extraction (DocRE) is the task of identifying semantic relationships between entities distributed throughout an entire document, rather than being limited to a single sentence. DocRE is fundamentally more complex than sentence-level relation extraction due to inter-sentential relation expression, the necessity for coreference and multi-hop reasoning, and the prevalence of multi-label cases and class imbalance. The field is tightly coupled with advances in heterogeneous graph modeling, neural LLMs, and structured reasoning frameworks, and is critical for automated knowledge base construction from large, unstructured corpora.

1. Problem Formulation and Core Challenges

Given a document $D = \{s_1, s_2, \ldots, s_n\}$ , a set $\mathcal{E}$ of entities with possible multiple mentions, and a fixed relation schema $\mathcal{R}$ , DocRE aims to extract all triples $(e_h, r, e_t)$ where $e_h, e_t \in \mathcal{E},\ r \in \mathcal{R}$ are such that the relation holds in the document context. The nature of DocRE induces several key technical challenges:

Long-distance and cross-sentence inference: Approximately 40% of relation instances in large benchmarks span two or more sentences, requiring propagation of information unachievable by models that operate strictly at the sentence level.
Coreference and anaphora: Entities and arguments may be referenced as pronouns or aliases; resolving such phenomena is central to DocRE performance.
Multi-label and long-tail distributions: Both multiple concurrent relations per entity pair and rare (low-frequency) classes are prevalent. This fundamentally differs from sentence-level settings (where single-label per pair is typical) and exacerbates class imbalance.
Interpretability and evidence localization: Only a small subset of the textual content in a document typically provides factual support for a given relation, complicating both prediction and evaluation (Delaunay et al., 2023, Xie et al., 2021).

2. Model Architectures and Extraction Paradigms

The architecture landscape in DocRE is diverse, with three dominant branches:

Heterogeneous Graph Neural Networks (GNNs): These methods construct document-level graphs with mention, entity, and sentence nodes, with edges encoding various document structures such as co-reference, syntactic dependencies, adjacency, and sentence-to-sentence relations. Techniques include R-GCNs, GATs, and dynamic attention over heterogeneous edge types. Mention-centered graph construction (fully connecting mentions) is particularly effective for maximizing multi-hop reasoning capacity (Pan et al., 2021, Liu et al., 2023, Xu et al., 2020).
Transformer-based models: Pretrained LLMs (PLMs) such as BERT are used to derive contextualized embeddings. Methods such as ATLOP, CorefBERT, SIRE, and DRE-MIR operate on the full document or entity-pair matrix, sometimes augmented with localized pooling, coreference-aware pretraining, or masked image reconstruction for multi-hop inference (Delaunay et al., 2023, Zhang et al., 2022).
Prompting and Modular Extraction with LLMs: The AutoRE system introduces an RHF (Relation–Head–Facts) paradigm, leveraging LLMs with parameter-efficient fine-tuning (QLoRA). The extraction is decomposed into three modular steps—relation detection, head entity identification, and tail entity/fact prediction—each realized as a distinct fine-tuned LLM module (Xue et al., 2024). Modular subtask decomposition with LoRA adapters allows scalable handling of multi-relation, multi-sentence settings.

A summary table of representative model families and their characteristic features:

Model Family	Graph/Aux Structures	Strengths
Heterogeneous GNNs	Mentions, entities, sents	Multi-hop, explicit reasoning, coref modeling
Transformers	Full doc, pair-matrix	End-to-end, PLM-powered, scalable
Modular LLM prompts	RHF (AutoRE)	Fast adaptation, modular, LLM capabilities

3. Reasoning Over Document Structure

Advanced DocRE models must capture fine-grained document structure with nuanced reasoning pathways:

Path-based explicit reasoning: Models extract a diverse set of meta-paths (e.g., intra-sentential, logical/bridge-entity, and coreference) and apply discriminative scoring, maximizing over the most convincing reasoning chain for each candidate triple (Xu et al., 2021, Xu et al., 2020). Ablations confirm the importance of path-type diversity and targeted attention on reasoning routes.
Coreference and anaphora modeling: Graphs augmented with explicit mention-pronoun affinity edges (with learned scores) yield superior cross-sentence performance; noise suppression mechanisms that merge only high-confidence pronoun links further reduce error from ambiguous coreference (Xue et al., 2022, Lu et al., 2023). Anaphor-aware graph construction enables information to propagate through pronouns and definite referents, addressing the critical bottleneck of reference resolution.
Evidence estimation and multi-view fusion: Methods such as Eider extract minimal “evidence sets” by learning sentence importance for each entity pair, then fuse predictions from full-document and evidence-focused subdocuments to boost robustness (Xie et al., 2021). The SIEF framework applies a sentence focusing loss that regularizes models to ignore irrelevant sentences, empirically enhancing both generalization and evidence localization (Xu et al., 2022).

4. Handling Relation Correlations, Long-Tail, and Multi-Label Phenomena

Multi-label entity pairs and data-imbalanced (long-tail) relations are central in DocRE.

Relation correlation modeling: Explicit learning of relation–relation co-occurrence structure via deep relation embeddings or relation-graph reasoning improves accurate assignment of multiple labels per pair and transfers strength from head (common) to tail (rare) relations. BERT-Correl side-trains coarse-grained and fine-grained co-occurrence predictors over correlation-aware embeddings, yielding pronounced gains in macro-F1 and for multi-label cases. Plug-and-play modules like LACE use a GAT over a statistical relation co-occurrence graph to augment any base model’s relation embedding space (Han et al., 2022, Huang et al., 2023).
Multi-relation and imbalance optimization: Adaptive thresholding (as in ATLOP and extensions) and custom loss functions (e.g., harmonic or macro-based) address non-mutual exclusion among labels and severe imbalance between positive and negative samples (Huang et al., 2023, Han et al., 2022). Losses are often harmonically combined with auxiliary correlation or reconstruction losses to boost rare relation recall.
Masked image reconstruction and pair-matrix inference: DRE-MIR treats the $N_e \times N_e$ entity-pair matrix as an “image,” applying masked image reconstruction with transformer-like inference over this matrix, capturing complex relation structure and achieving robust multi-label performance (Zhang et al., 2022).

5. Experimental Benchmarks, Evaluation, and Performance Trends

The canonical benchmarks for DocRE are DocRED (Wikipedia domain, 96 relations), CDR (biomedical), and GDA (biomedical), all with high entity and relation density per document. Standard metrics are micro-F1, IgnF1 (excluding facts seen in train), Intra-F1 (same-sentence) and Inter-F1 (cross-sentence).

Key recent results (test splits):

Model	DocRED F1	CDR F1	GDA F1	Notes
AutoRE (LLM-QLoRA)	51.9	—	—	SOTA, modular LLM
BERT+EIDER	62.5	70.6	84.5	Evidence fusion
LARSON	62.8	71.6	86.0	Explicit syntax/subsentences
DRE-MIR	62.9	76.8	86.4	Masked image+matrix
AA (Anaphor-A.)	63.4	—	—	Anaphoric modeling
BERT-Correl	61.3	71.6	—	Relation correlation

State-of-the-art models leverage fused graph and PLM architectures, explicit relation correlation and evidence handling, or LLM modularity. Inter-sentence (cross-sentence) F1 remains consistently 10–15 points below intra-sentence, highlighting the sustained challenge of long-distance reasoning (Delaunay et al., 2023, Xue et al., 2024).

6. Advances in Modular LLM-based and Few-shot DocRE

AutoRE and LLM approaches: The AutoRE system demonstrates that off-the-shelf LLMs fine-tuned via QLoRA adapters on modular RHF subtasks can achieve state-of-the-art results, significantly exceeding strong baselines such as TAG (Xue et al., 2024). RHF’s explicit modularization isolates high data-volume imbalance among tasks and prevents overfitting.
Few-shot settings: Benchmarks such as FREDo reveal that few-shot DocRE is substantially more challenging than its sentence-level counterpart, mainly due to the extreme NOTA (none-of-the-above) class ratio, multi-label complexity, and the need for adaptive NOTA handling and support-prototype adaptation in the presence of domain shift (Popovic et al., 2022).

7. Open Issues and Future Directions

Several prominent research avenues persist:

Generalization to unseen relations and schemas: Models such as AutoRE remain limited to seen relations; adaptation to large, open or evolving schemas, and zero-shot relation extraction, is a significant open challenge (Xue et al., 2024).
Scaling of modular subtask templates and instruction-tuning corpora: As relation vocabularies grow, prompt engineering and subtask decomposition must scale without prohibitive overheads.
Efficient document encoding: PLM and GNN integration remains costly for long documents; efficient document transformers and sparse graph induction are active research areas.
Stronger integration of world knowledge and reasoning: Techniques to inject external knowledge graphs, compile richer commonsense priors, and perform multi-hop or event-centric reasoning are underexplored (Wang et al., 2022).
Joint modeling for downstream IE: Cohesive frameworks for NER, coreference, and DocRE in a single end-to-end system may further boost extraction performance.
Human-level interpretability and evidence support: Models that can return minimal supporting evidence for each extracted relation enhance transparency and practical utility (Xie et al., 2021, Duan et al., 2022).

DocRE research continues to push the boundaries of document understanding, advancing from static graph GNNs and contextual PLMs to modular, fine-tunable LLMs, while targeting ever-increasing demands in scalability, stability, and reasoning depth (Xue et al., 2024, Delaunay et al., 2023).