GenIE: Generative Information Extraction (2112.08340v3)

Published 15 Dec 2021 in cs.CL, cs.LG, and stat.ML

Abstract: Structured and grounded representation of text is typically formalized by closed information extraction, the problem of extracting an exhaustive set of (subject, relation, object) triplets that are consistent with a predefined set of entities and relations from a knowledge base schema. Most existing works are pipelines prone to error accumulation, and all approaches are only applicable to unrealistically small numbers of entities and relations. We introduce GenIE (generative information extraction), the first end-to-end autoregressive formulation of closed information extraction. GenIE naturally exploits the language knowledge from the pre-trained transformer by autoregressively generating relations and entities in textual form. Thanks to a new bi-level constrained generation strategy, only triplets consistent with the predefined knowledge base schema are produced. Our experiments show that GenIE is state-of-the-art on closed information extraction, generalizes from fewer training data points than baselines, and scales to a previously unmanageable number of entities and relations. With this work, closed information extraction becomes practical in realistic scenarios, providing new opportunities for downstream tasks. Finally, this work paves the way towards a unified end-to-end approach to the core tasks of information extraction. Code, data and models available at https://github.com/epfl-dlab/GenIE.

Citations (59)

View on Semantic Scholar

Summary

The paper presents GenIE, an end-to-end autoregressive model that converts unstructured text into structured (subject, relation, object) triplets using a bi-level constrained generation approach.
The model outperforms existing systems like SetGenNet by mitigating error propagation and demonstrating robust few-shot learning even for large-scale knowledge bases.
The paper highlights future directions such as multilingual applications and unified extraction tasks, paving the way for broader practical use in knowledge graph population and reasoning.

GenIE: Generative Information Extraction

The paper "GenIE: Generative Information Extraction" introduces a novel approach to closed information extraction (cIE) through an autoregressive model known as GenIE. Traditional methods of cIE have primarily relied on pipeline-driven architectures that sequentially combine different tasks like named entity recognition (NER), entity linking (EL), and relation classification (RC). However, these sequence-based systems are prone to error propagation and have largely remained feasible only for smaller knowledge base (KB) schemas due to scalability constraints.

Constrained Autoregressive Approach

The core contribution of this paper is the introduction of GenIE, the first end-to-end autoregressive model specifically designed for closed information extraction. GenIE leverages a sequence-to-sequence transformer, in this case, the BART model, to translate unstructured text into structured (subject, relation, object) triplets. Through a novel bi-level constrained generation strategy, GenIE ensures that only valid triplets, compliant with a predefined KB schema, are generated.

Numerical Performance and Scalability

Empirically, GenIE sets a new benchmark for cIE, outperforming previous models like Set Generation Networks (SetGenNet). Its ability to generalize from fewer data points and robustly scale to include millions of entities and hundreds of relations demonstrates the model's practical viability in realistic environments. Specifically, experiments exhibit that GenIE maintains competitive performance metrics even when expanded to encompass large schemas, such as those derived from Wikidata, with approximately 6 million entities and 857 relations.

Comparison with Prior Models

The study conducted an extensive evaluation, comparing GenIE against multiple baselines, including traditional pipeline architectures and SetGenNet, which is known for its end-to-end capabilities in relation extraction. Results clearly show that GenIE significantly mitigates issues of error propagation present in pipeline systems and excels in fewer-shot learning scenarios, especially for less frequently occurring relations.

Implications and Future Directions

The integration of a constrained autoregressive model presents substantial opportunities for practical applications in AI, particularly in tasks like knowledge graph population and maintenance, symbolic representation, and reasoning. GenIE's formulation also opens the door for unifying a spectrum of related tasks, from entity linking to slot filling, under a single autoregressive framework.

Moving forward, there are promising avenues for expansion, such as multilingual applications and optimizing the computational footprint of autoregressive models. Moreover, the approach could bridge the gap between open information extraction (oIE) and closed information extraction by accommodating literals in entity positions, thereby enhancing its applicability in real-world scenarios. Integrating such models into multilingual settings, as evidenced by efforts like mGENRE in entity linking, could further extend the reach and utility of GenIE.

In conclusion, the paper provides a comprehensive analysis and a substantial advancement in the domain of information extraction, notably scaling the ability to work with realistic and complex knowledge base schemas while setting a foundation for unified models addressing various information extraction tasks.