Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RED$^{\rm FM}$: a Filtered and Multilingual Relation Extraction Dataset (2306.09802v2)

Published 16 Jun 2023 in cs.CL

Abstract: Relation Extraction (RE) is a task that identifies relationships between entities in a text, enabling the acquisition of relational facts and bridging the gap between natural language and structured knowledge. However, current RE models often rely on small datasets with low coverage of relation types, particularly when working with languages other than English. In this paper, we address the above issue and provide two new resources that enable the training and evaluation of multilingual RE systems. First, we present SRED${\rm FM}$, an automatically annotated dataset covering 18 languages, 400 relation types, 13 entity types, totaling more than 40 million triplet instances. Second, we propose RED${\rm FM}$, a smaller, human-revised dataset for seven languages that allows for the evaluation of multilingual RE systems. To demonstrate the utility of these novel datasets, we experiment with the first end-to-end multilingual RE model, mREBEL, that extracts triplets, including entity types, in multiple languages. We release our resources and model checkpoints at https://www.github.com/babelscape/rebel

Overview of "RED<sup>FM</sup>: a Filtered and Multilingual Relation Extraction Dataset"

The paper introduces two notable resources aimed at advancing the field of Relation Extraction (RE): SRED<sup>FM</sup> and RED<sup>FM</sup>. These datasets represent a significant stride towards enhancing multilingual capabilities in RE systems, addressing the often-cited limitation of existing datasets that predominantly focus on English and offer limited relation types.

Dataset Contributions

  1. SRED<sup>FM</sup> Dataset:
    • Automatically annotated and extensive, covering 18 languages and 400 relation types.
    • Provides a total of over 40 million triplet instances along with 13 entity types.
    • Designed to facilitate the training and evaluation of large-scale multilingual RE systems.
  2. RED<sup>FM</sup> Dataset:
    • A curated, human-revised dataset that provides high-quality annotations across seven languages.
    • Comprises 32 relation types, enabling rigorous evaluation with precise multilingual RE benchmarks.

Methodology

  • End-to-End Relation Extraction: A shift from traditional modular approaches to an end-to-end model that aims to minimize error propagation typical of sequential NER and Relation Classification steps.
  • Triplet Critic System: A novel cross-encoder designed to filter out erroneous annotations using human-labelled data, ensuring that the automatically annotated dataset maintains a high level of accuracy.
  • Entity Typing: Employs a Transformer-based NER classifier that refines entity type assignments by leveraging relational information in BabelNet, improving the quality of annotations and facilitating strict evaluation metrics in RE tasks.

Model and Evaluation

  • mREBEL: The first end-to-end multilingual RE model, capable of extracting entity types and relations across multiple languages.
  • Performance: Evaluation against both new datasets (SRED<sup>FM</sup> and RED<sup>FM</sup>) and existing benchmarks like SMiLER shows that mREBEL effectively outperforms several competitive baselines.

Implications and Future Directions

The presented datasets, along with the mREBEL model, contribute to RE by offering resources and tools for comprehensive multilingual processing. These advancements hold the potential to:

  • Enhance cross-lingual knowledge extraction and integration with Knowledge Graphs.
  • Enable broader applications in multilingual NLP tasks, where understanding relationships in diverse languages is crucial.

Future research could explore further refining entity typing methodologies and assessing the impacts of more complex relation types and entities on system performance. Additionally, enhancing zero-shot capabilities across non-seen languages using these resources presents another avenue for exploration.

In conclusion, the paper addresses a key deficiency in multilingual RE resources through robust dataset creation and model development. It serves as a stepping stone for continued advancement in accurately and effectively extracting relational information across languages.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Elisa Bassignana and Barbara Plank. 2022. What do you mean by relation extraction? a survey on datasets and study on scientific relation classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 67–83, Dublin, Ireland. Association for Computational Linguistics.
  2. Question answering systems: Survey and trends. Procedia Computer Science, 73:366–375. International Conference on Advanced Wireless Information and Communication Technologies (AWICT 2015).
  3. Multilingual relation classification via efficient and effective prompting. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Online and Abu Dhabi, the United Arab Emirates. Association for Computational Linguistics.
  4. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
  5. Automatic text summarization: A comprehensive survey. Expert Systems with Applications, 165:113679.
  6. T-REx: A large scale alignment of natural language with knowledge base triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  7. Beyond english-centric multilingual machine translation.
  8. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing.
  9. Pere-Lluís Huguet Cabot and Roberto Navigli. 2021. REBEL: Relation extraction by end-to-end language generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2370–2381, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  10. George A. Miller. 1995. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41.
  11. Makoto Miwa and Yutaka Sasaki. 2014. Modeling joint entity and relation extraction with table representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1858–1869, Doha, Qatar. Association for Computational Linguistics.
  12. David Nadeau and Satoshi Sekine. 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3–26.
  13. Ten years of BabelNet: A survey. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, pages 4559–4567.
  14. Roberto Navigli and Simone Paolo Ponzetto. 2012. Babelnet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193:217–250.
  15. Structured prediction as translation between augmented natural languages. In 9th International Conference on Learning Representations, ICLR 2021.
  16. End-to-end relation extraction using neural networks and Markov Logic Networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 818–827, Valencia, Spain. Association for Computational Linguistics.
  17. Modeling relations and their mentions without labeled text. In Machine Learning and Knowledge Discovery in Databases, pages 148–163, Berlin, Heidelberg. Springer Berlin Heidelberg.
  18. Multilingual entity and relation extraction dataset and model. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1946–1955, Online. Association for Computational Linguistics.
  19. Re-tacred: Addressing shortcomings of the tacred dataset. Proceedings of the AAAI Conference on Artificial Intelligence, 35(15):13843–13850.
  20. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation.
  21. Let’s Stop Incorrect Comparisons in End-to-end Relation Extraction! In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3689–3701, Online. Association for Computational Linguistics.
  22. Multilingual translation from denoising pre-training. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3450–3466, Online. Association for Computational Linguistics.
  23. Named Entity Recognition for Entity Linking: What works and what’s next. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2584–2596, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  24. WikiNEuRal: Combined neural and knowledge-based silver data creation for multilingual NER. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2521–2533, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  25. Simone Tedeschi and Roberto Navigli. 2022. MultiNERD: A multilingual, multi-genre and fine-grained dataset for named entity recognition (and disambiguation). In Findings of the Association for Computational Linguistics: NAACL 2022, pages 801–812, Seattle, United States. Association for Computational Linguistics.
  26. Jue Wang and Wei Lu. 2020. Two are better than one: Joint entity and relation extraction with table-sequence encoders. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1706–1721, Online. Association for Computational Linguistics.
  27. Symbolic knowledge distillation: from general language models to commonsense models.
  28. LUKE: Deep contextualized entity representations with entity-aware self-attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6442–6454, Online. Association for Computational Linguistics.
  29. DocRED: A large-scale document-level relation extraction dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 764–777, Florence, Italy. Association for Computational Linguistics.
  30. Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 35–45, Copenhagen, Denmark. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
Citations (12)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com