RED$^{\rm FM}$: a Filtered and Multilingual Relation Extraction Dataset (2306.09802v2)

Published 16 Jun 2023 in cs.CL

Abstract: Relation Extraction (RE) is a task that identifies relationships between entities in a text, enabling the acquisition of relational facts and bridging the gap between natural language and structured knowledge. However, current RE models often rely on small datasets with low coverage of relation types, particularly when working with languages other than English. In this paper, we address the above issue and provide two new resources that enable the training and evaluation of multilingual RE systems. First, we present SRED$^{\rm FM}$, an automatically annotated dataset covering 18 languages, 400 relation types, 13 entity types, totaling more than 40 million triplet instances. Second, we propose RED$^{\rm FM}$, a smaller, human-revised dataset for seven languages that allows for the evaluation of multilingual RE systems. To demonstrate the utility of these novel datasets, we experiment with the first end-to-end multilingual RE model, mREBEL, that extracts triplets, including entity types, in multiple languages. We release our resources and model checkpoints at https://www.github.com/babelscape/rebel

PDF HTML Abstract

Overview of "REDFM: a Filtered and Multilingual Relation Extraction Dataset"

The paper introduces two notable resources aimed at advancing the field of Relation Extraction (RE): SREDFM and REDFM. These datasets represent a significant stride towards enhancing multilingual capabilities in RE systems, addressing the often-cited limitation of existing datasets that predominantly focus on English and offer limited relation types.

Dataset Contributions

SREDFM Dataset:
- Automatically annotated and extensive, covering 18 languages and 400 relation types.
- Provides a total of over 40 million triplet instances along with 13 entity types.
- Designed to facilitate the training and evaluation of large-scale multilingual RE systems.
REDFM Dataset:
- A curated, human-revised dataset that provides high-quality annotations across seven languages.
- Comprises 32 relation types, enabling rigorous evaluation with precise multilingual RE benchmarks.

Methodology

End-to-End Relation Extraction: A shift from traditional modular approaches to an end-to-end model that aims to minimize error propagation typical of sequential NER and Relation Classification steps.
Triplet Critic System: A novel cross-encoder designed to filter out erroneous annotations using human-labelled data, ensuring that the automatically annotated dataset maintains a high level of accuracy.
Entity Typing: Employs a Transformer-based NER classifier that refines entity type assignments by leveraging relational information in BabelNet, improving the quality of annotations and facilitating strict evaluation metrics in RE tasks.

Model and Evaluation

mREBEL: The first end-to-end multilingual RE model, capable of extracting entity types and relations across multiple languages.
Performance: Evaluation against both new datasets (SREDFM and REDFM) and existing benchmarks like SMiLER shows that mREBEL effectively outperforms several competitive baselines.

Implications and Future Directions

The presented datasets, along with the mREBEL model, contribute to RE by offering resources and tools for comprehensive multilingual processing. These advancements hold the potential to:

Enhance cross-lingual knowledge extraction and integration with Knowledge Graphs.
Enable broader applications in multilingual NLP tasks, where understanding relationships in diverse languages is crucial.

Future research could explore further refining entity typing methodologies and assessing the impacts of more complex relation types and entities on system performance. Additionally, enhancing zero-shot capabilities across non-seen languages using these resources presents another avenue for exploration.

In conclusion, the paper addresses a key deficiency in multilingual RE resources through robust dataset creation and model development. It serves as a stepping stone for continued advancement in accurately and effectively extracting relational information across languages.