Distant Supervision for Relation Extraction beyond the Sentence Boundary

Published 15 Sep 2016 in cs.CL | (1609.04873v3)

Abstract: The growing demand for structured knowledge has led to great interest in relation extraction, especially in cases with limited supervision. However, existing distance supervision approaches only extract relations expressed in single sentences. In general, cross-sentence relation extraction is under-explored, even in the supervised-learning setting. In this paper, we propose the first approach for applying distant supervision to cross- sentence relation extraction. At the core of our approach is a graph representation that can incorporate both standard dependencies and discourse relations, thus providing a unifying way to model relations within and across sentences. We extract features from multiple paths in this graph, increasing accuracy and robustness when confronted with linguistic variation and analysis error. Experiments on an important extraction task for precision medicine show that our approach can learn an accurate cross-sentence extractor, using only a small existing knowledge base and unlabeled text from biomedical research articles. Compared to the existing distant supervision paradigm, our approach extracted twice as many relations at similar precision, thus demonstrating the prevalence of cross-sentence relations and the promise of our approach.

Abstract PDF Upgrade to Chat

Citations (224)

View on Semantic Scholar

Summary

The paper introduces DISCREX, a novel graph-based method that integrates intra- and inter-sentence dependencies for relation extraction.
It leverages minimal-span candidate selection and path n-gram features to mitigate parser errors and improve classification accuracy.
Empirical results demonstrate doubled recall compared to single-sentence extraction methods, emphasizing the need for cross-sentence approaches.

Distant Supervision for Relation Extraction beyond the Sentence Boundary

The paper "Distant Supervision for Relation Extraction beyond the Sentence Boundary" by Chris Quirk and Hoifung Poon extends the concept of distant supervision to cross-sentence relation extraction. While distant supervision has traditionally confined itself to single-sentence data, the proposed approach, DISCREX, innovatively incorporates cross-sentence relations using a document-level graph framework that includes both standard dependencies and discourse relations.

Core Contribution

The paper introduces a novel graph-based representation for relation extraction that accommodates both intra-sentence and inter-sentence dependencies. Specifically, this method integrates various edges representing dependency, adjacency, and discourse relations, such as coreference and rhetorical links, enabling a more comprehensive extraction of relationships that span multiple sentences. This graph-based approach allows for feature extraction from multiple paths, thereby increasing robustness against linguistic variability and parsing errors, which are common in complex texts like biomedical literature.

Methodology

DISCREX employs distant supervision by leveraging a knowledge base (KB) and unlabeled text from biomedical articles. It extracts relation features from multiple paths linking entities within the graph, utilizing path n-gram features to improve generalizability and account for parser errors. The approach also emphasizes minimal-span candidate selection, which significantly enhances classification accuracy by focusing only on the shortest linguistic spans where co-occurring entities are found.

Results

Empirical evaluations on drug-gene interaction extraction from PubMed Central articles demonstrate the efficacy of DISCREX. The approach extracted relations with notably higher recall—doubling the yield of extracted relations compared to single-sentence extraction—while maintaining a comparable level of precision. This indicates that a significant proportion of relations in scientific literature are expressed across sentence boundaries, underscoring the necessity for cross-sentence extraction methods.

Implications and Future Directions

The implications of this research are significant for domains requiring comprehensive information extraction from large corpora, such as precision medicine. The ability to extract relations across sentence boundaries without requiring expensive manually annotated data sets marks a substantial step forward. It improves scalability and applicability to expanding domains of specialized knowledge, where traditional single-sentence methods might falter.

Future research could explore the integration of this cross-sentence framework with additional advanced NLP techniques, such as improved discourse parsing and better entity linking systems optimized for specialized domains. Moreover, extending the approach to incorporate implicit reasoning and handling noise in distant supervision labels could enhance performance further. The potential applications are vast, spanning various fields where relational databases are enriched through natural language sources.

This work serves as an important milestone in advancing automated knowledge extraction systems, making a strong case for the inclusion of document-level relationships in relation extraction tasks. The methodological innovations and impressive results suggest a promising direction for future research and practical applications in real-world AI implementations.

Markdown