DocRED: A Large-Scale Document-Level Relation Extraction Dataset (1906.06127v3)

Published 14 Jun 2019 in cs.CL

Abstract: Multiple entities in a document generally exhibit complex inter-sentence relations, and cannot be well handled by existing relation extraction (RE) methods that typically focus on extracting intra-sentence relations for single entity pairs. In order to accelerate the research on document-level RE, we introduce DocRED, a new dataset constructed from Wikipedia and Wikidata with three features: (1) DocRED annotates both named entities and relations, and is the largest human-annotated dataset for document-level RE from plain text; (2) DocRED requires reading multiple sentences in a document to extract entities and infer their relations by synthesizing all information of the document; (3) along with the human-annotated data, we also offer large-scale distantly supervised data, which enables DocRED to be adopted for both supervised and weakly supervised scenarios. In order to verify the challenges of document-level RE, we implement recent state-of-the-art methods for RE and conduct a thorough evaluation of these methods on DocRED. Empirical results show that DocRED is challenging for existing RE methods, which indicates that document-level RE remains an open problem and requires further efforts. Based on the detailed analysis on the experiments, we discuss multiple promising directions for future research.

PDF Abstract

An Insightful Overview of "DocRED: A Large-Scale Document-Level Relation Extraction Dataset"

In the paper titled "DocRED: A Large-Scale Document-Level Relation Extraction Dataset," the authors present a novel dataset aimed at advancing the field of document-level relation extraction (RE). DocRED is derived from Wikipedia and Wikidata, offering a robust framework for addressing complex inter-sentence relations that cannot be effectively managed by traditional sentence-level RE methods.

Key Features of DocRED

DocRED represents a substantial leap in the availability of resources for document-level RE by incorporating several defining features:

Scale and Annotation: It is the largest human-annotated dataset for document-level RE, with over 56,000 relational facts annotated across over 5,000 Wikipedia documents. This scale ensures a wide representation of entity relations.
Multi-Sentence Context: The dataset is specifically designed to require understanding and reasoning across multiple sentences. This feature distinguishes it from sentence-level datasets, where context is limited to individual sentences.
Diverse Data Scenarios: Alongside human annotations, DocRED includes large-scale distantly supervised data, enabling its application in both supervised and weakly supervised learning scenarios.

Evaluation and Challenges

The empirical analysis in the paper involved applying state-of-the-art RE models to DocRED, revealing significant challenges:

Existing models, which perform well on sentence-level tasks, face difficulties with the intricacies of document-level tasks. The necessity to synthesize information from numerous sentences significantly complicates the extraction process.
A salient statistic is that about 40.7% of relational facts in DocRED require inter-sentence reasoning, illustrating the dataset's complexity.

Implications for Research and Future Directions

The introduction of DocRED opens multiple avenues for both theoretical exploration and practical advancements in AI:

Model Development: Researchers are encouraged to develop models capable of handling the document-level context more effectively. There is a clear indication that current methods do not suffice for tasks requiring deep contextual understanding and reasoning.
Reasoning and Comprehension: The dataset propels the need for systems that can perform logical, coreference, and common-sense reasoning across documents, marking a step towards more sophisticated natural language understanding.
Weakly Supervised Learning: The inclusion of distantly supervised data is an invitation to innovate in noise-tolerant learning techniques, which could refine the application of weak supervision in document-level RE.

Conclusion

DocRED significantly contributes to the field by providing a comprehensive dataset that challenges existing methodologies and calls for novel approaches to document-level relation extraction. The dataset not only enhances the scope for evaluating RE models but also prompts unprecedented research avenues in AI. As researchers engage with this dataset, advancements in the understanding and processing of complex narrative contexts can be anticipated, further bridging the gap between human-level and machine-level text comprehension.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Yuan Yao (292 papers)
Deming Ye (10 papers)
Peng Li (390 papers)
Xu Han (270 papers)
Yankai Lin (125 papers)
Zhenghao Liu (77 papers)
Zhiyuan Liu (433 papers)
Lixin Huang (1 paper)
Jie Zhou (687 papers)
Maosong Sun (337 papers)

Citations (421)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - thunlp/DocRED: Dataset and codes for ACL 2019 DocRED: A Large-Scale Document-Level Relation Extraction Dataset. (643 stars)