WebIE: Faithful and Robust Information Extraction on the Web (2305.14293v2)
Abstract: Extracting structured and grounded fact triples from raw text is a fundamental task in Information Extraction (IE). Existing IE datasets are typically collected from Wikipedia articles, using hyperlinks to link entities to the Wikidata knowledge base. However, models trained only on Wikipedia have limitations when applied to web domains, which often contain noisy text or text that does not have any factual information. We present WebIE, the first large-scale, entity-linked closed IE dataset consisting of 1.6M sentences automatically collected from the English Common Crawl corpus. WebIE also includes negative examples, i.e. sentences without fact triples, to better reflect the data on the web. We annotate ~21K triples from WebIE through crowdsourcing and introduce mWebIE, a translation of the annotated set in four other languages: French, Spanish, Portuguese, and Hindi. We evaluate the in-domain, out-of-domain, and zero-shot cross-lingual performance of generative IE models and find models trained on WebIE show better generalisability. We also propose three training strategies that use entity linking as an auxiliary task. Our experiments show that adding Entity-Linking objectives improves the faithfulness of our generative IE models.
- ReFinED: An efficient zero-shot-capable approach to end-to-end entity linking. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track, pages 209--220, Hybrid: Seattle, Washington + Online. Association for Computational Linguistics.
- A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632--642, Lisbon, Portugal. Association for Computational Linguistics.
- Autoregressive entity retrieval. In International Conference on Learning Representations.
- Importance sampling for unbiased on-demand evaluation of knowledge base population. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1038--1048, Copenhagen, Denmark. Association for Computational Linguistics.
- Kevin Clark and Christopher D. Manning. 2016. Improving coreference resolution by learning entity-level distributed representations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 643--653, Berlin, Germany. Association for Computational Linguistics.
- No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
- Markus Eberts and Adrian Ulges. 2020. Span-based joint entity and relation extraction with transformer pre-training. ECAI, page 2006–2013.
- T-REx: A large scale alignment of natural language with knowledge base triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
- PPDB: The paraphrase database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 758--764, Atlanta, Georgia. Association for Computational Linguistics.
- Does entity abstraction help generative transformers reason? Transactions on Machine Learning Research.
- Adam Grycner and Gerhard Weikum. 2016. POLY: Mining relational paraphrases from multilingual sentences. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2183--2192, Austin, Texas. Association for Computational Linguistics.
- FewRel: A large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4803--4809, Brussels, Belgium. Association for Computational Linguistics.
- Pere-Lluís Huguet Cabot and Roberto Navigli. 2021. REBEL: Relation extraction by end-to-end language generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2370--2381, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Span-based joint entity and relation extraction with attention-based span-specific and contextual semantic representations. In Proceedings of the 28th International Conference on Computational Linguistics, pages 88--99, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339--351.
- GenIE: Generative information extraction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4626--4643, Seattle, United States. Association for Computational Linguistics.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871--7880, Online. Association for Computational Linguistics.
- KnowledgeNet: A benchmark dataset for knowledge base population. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 749--758, Hong Kong, China. Association for Computational Linguistics.
- Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 1003--1011, Suntec, Singapore. Association for Computational Linguistics.
- A well-composed text is half done! composition sampling for diverse conditional generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1319--1339, Dublin, Ireland. Association for Computational Linguistics.
- Planning with learned entity prompts for abstractive summarization. Transactions of the Association for Computational Linguistics, 9:1475--1492.
- Tapas Nayak and Hwee Tou Ng. 2020. Effective modeling of encoder-decoder architecture for joint entity and relation extraction. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8528--8535.
- Webred: Effective pretraining and finetuning for relation extraction on the web. arXiv preprint arXiv:2102.09681.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1--67.
- Modeling relations and their mentions without labeled text. In Machine Learning and Knowledge Discovery in Databases, pages 148--163, Berlin, Heidelberg. Springer Berlin Heidelberg.
- Knowgl: Knowledge generation and linking from text. In Proceedings of the AAAI Conference on Artificial Intelligence.
- Multilingual entity and relation extraction dataset and model. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1946--1955, Online. Association for Computational Linguistics.
- Re-tacred: Addressing shortcomings of the tacred dataset. Proceedings of the AAAI Conference on Artificial Intelligence, 35(15):13843--13850.
- Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc.
- Revisiting DocRED - addressing the false negative problem in relation extraction. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8472--8487, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Multilingual translation from denoising pre-training. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3450--3466, Online. Association for Computational Linguistics.
- Neural relation extraction for knowledge base enrichment. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 229--240, Florence, Italy. Association for Computational Linguistics.
- Improving distantly supervised document-level relation extraction through natural language inference. In Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 14--20, Hybrid. Association for Computational Linguistics.
- Towards a unified model for generating answers and explanations in visual question answering. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1648--1660, Dubrovnik, Croatia. Association for Computational Linguistics.
- Evaluation of fake news detection with knowledge-enhanced language models. In Proceedings of the International AAAI Conference on Web and Social Media, volume 16, pages 1425--1429.
- A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112--1122, New Orleans, Louisiana. Association for Computational Linguistics.
- LUKE: Deep contextualized entity representations with entity-aware self-attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6442--6454, Online. Association for Computational Linguistics.
- CodRED: A cross-document relation extraction dataset for acquiring knowledge in the wild. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4452--4472, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- DocRED: A large-scale document-level relation extraction dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 764--777, Florence, Italy. Association for Computational Linguistics.
- Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 35--45, Copenhagen, Denmark. Association for Computational Linguistics.
- Chenxi Whitehouse (17 papers)
- Clara Vania (16 papers)
- Alham Fikri Aji (94 papers)
- Christos Christodoulopoulos (15 papers)
- Andrea Pierleoni (8 papers)