ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata (2405.09496v1)
Abstract: We introduce ParaNames, a massively multilingual parallel name resource consisting of 140 million names spanning over 400 languages. Names are provided for 16.8 million entities, and each entity is mapped from a complex type hierarchy to a standard type (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate the usefulness of ParaNames on two tasks. First, we perform canonical name translation between English and 17 other languages. Second, we use it as a gazetteer for multilingual named entity recognition, obtaining performance improvements on all 10 languages evaluated.
- Statistical and neural methods for cross-lingual entity label mapping in knowledge graphs. In Text, Speech, and Dialogue, pages 39–51, Cham. Springer International Publishing.
- Report of NEWS 2018 named entity transliteration shared task. In Proceedings of the Seventh Named Entities Workshop, pages 55–73, Melbourne, Australia. Association for Computational Linguistics.
- NEWS 2018 whitepaper. In Proceedings of the Seventh Named Entities Workshop, pages 47–54, Melbourne, Australia. Association for Computational Linguistics.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1064–1074, Berlin, Germany. Association for Computational Linguistics.
- H. B. Mann and D. R. Whitney. 1947. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. The Annals of Mathematical Statistics, 18(1):50 – 60.
- Yuval Merhav and Stephen Ash. 2018. Design challenges in named entity transliteration. In Proceedings of the 27th International Conference on Computational Linguistics, pages 630–640, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Molly Moran and Constantine Lignos. 2020. Effective architectures for low resource multilingual named entity transliteration. In Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages, pages 79–86, Suzhou, China. Association for Computational Linguistics.
- fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Minnesota. Association for Computational Linguistics.
- Soft gazetteers for low-resource named entity recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8118–8123, Online. Association for Computational Linguistics.
- Shlomo S Sawilowsky. 2009. New effect size rules of thumb. Journal of modern applied statistical methods, 8(2):26.
- Stephanie Strassel and Jennifer Tracey. 2016. LORELEI language packs: Data, tools, and resources for technology development in low resource languages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3273–3280, Portorož, Slovenia. European Language Resources Association (ELRA).
- MasakhaNER: Named entity recognition for African languages. Transactions of the Association for Computational Linguistics, 9:1116–1131.
- TRANSLIT: A large-scale name transliteration resource. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3265–3271, Marseille, France. European Language Resources Association.
- Transliterating from all languages. In Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers, Denver, Colorado, USA. Association for Machine Translation in the Americas.
- A broad-coverage corpus for Finnish named entity recognition. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4615–4624, Marseille, France. European Language Resources Association.
- HiNER: A large Hindi named entity recognition dataset. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4467–4476, Marseille, France. European Language Resources Association.
- JRC-NAMES: A freely available, highly multilingual named entity resource. In Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, pages 104–110, Hissar, Bulgaria. Association for Computational Linguistics.
- Creating a translation matrix of the Bible’s names across 591 languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).