DaMuEL: A Large Multilingual Dataset for Entity Linking (2306.09288v1)
Abstract: We present DaMuEL, a large Multilingual Dataset for Entity Linking containing data in 53 languages. DaMuEL consists of two components: a knowledge base that contains language-agnostic information about entities, including their claims from Wikidata and named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART, MANUFACTURED); and Wikipedia texts with entity mentions linked to the knowledge base, along with language-specific text from Wikidata such as labels, aliases, and descriptions, stored separately for each language. The Wikidata QID is used as a persistent, language-agnostic identifier, enabling the combination of the knowledge base with language-specific texts and information for each entity. Wikipedia documents deliberately annotate only a single mention for every entity present; we further automatically detect all mentions of named entities linked from each document. The dataset contains 27.9M named entities in the knowledge base and 12.3G tokens from Wikipedia texts. The dataset is published under the CC BY-SA license at https://hdl.handle.net/11234/1-5047.
- (2016). TAC KBP Spanish Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2012-2014. https://doi.org/10.35111/gn9a-gb23.
- (2017). TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2011-2014. https://doi.org/10.35111/86hk-xg90.
- (2018). TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013 LDC2018T16. https://doi.org/10.35111/13g2-th80.
- (2015). Edrak: Entity-centric data resource for arabic knowledge. In Proceedings of the Second Workshop on Arabic Natural Language Processing, pages 191–200.
- (2011). Robust disambiguation of named entities in text. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 782–792, Edinburgh, Scotland, UK., July. Association for Computational Linguistics.
- (2014). Discovering emerging entities with ambiguous names. In Proceedings of the 23rd international conference on World wide web, pages 385–396.
- (2010). Overview of the tac 2010 knowledge base population track. In Third text analysis conference (TAC 2010).
- (2016). Universal Dependencies v1: A multilingual treebank collection. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pages 1659–1666, Portorož, Slovenia. European Language Resources Association.
- (2020). Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations.
- (2014). TAC KBP Reference Knowledge Base. https://doi.org/10.35111/4yac-wb16.
- Straka, M. (2018). UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 197–207, Brussels, Belgium, October. Association for Computational Linguistics.
- (2003). Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of CoNLL-2003, pages 142–147. Edmonton, Canada.
- (2016). Apache spark: A unified engine for big data processing. Commun. ACM, 59(11):56–65, oct.