Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross-lingual Named Entity Corpus for Slavic Languages (2404.00482v2)

Published 30 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: This paper presents a corpus manually annotated with named entities for six Slavic languages - Bulgarian, Czech, Polish, Slovenian, Russian, and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017-2023 as a part of the Workshops on Slavic Natural Language Processing. The corpus consists of 5 017 documents on seven topics. The documents are annotated with five classes of named entities. Each entity is described by a category, a lemma, and a unique cross-lingual identifier. We provide two train-tune dataset splits - single topic out and cross topics. For each split, we set benchmarks using a transformer-based neural network architecture with the pre-trained multilingual models - XLM-RoBERTa-large for named entity mention recognition and categorization, and mT5-large for named entity lemmatization and linking.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Željko Agić and Nikola Ljubešić. 2014. The SETimes.HR linguistically annotated corpus of Croatian. In Ninth International Conference on Language Resources and Evaluation (LREC 2014), pages 1724–1727, Reykjavík, Iceland.
  2. Runne-2022 shared task: Recognizing nested named entities. In Proceedings of the International Conference on Computational Linguistics and Intellectual Technologies “DIALOGUE”, pages 33–41.
  3. Tagging named entities in Croatian tweets. Slovenščina 2.0: empirical, applied and interdisciplinary research, 4(1):20–41.
  4. Nancy Chinchor. 1998. Overview of MUC-7/MET-2. In Proceedings of Seventh Message Understanding Conference (MUC-7).
  5. Unsupervised cross-lingual representation learning at scale. CoRR, abs/1911.02116.
  6. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  7. The Automatic Content Extraction (ACE) program—tasks, data, and evaluation. In Fourth International Conference on Language Resources and Evaluation (LREC 2004), pages 837–840, Lisbon, Portugal.
  8. Report of NEWS 2016 machine transliteration shared task. In Proceedings of The Sixth Named Entities Workshop, pages 58–72, Berlin, Germany.
  9. SemEval-2023 task 2: Fine-grained multilingual named entity recognition (MultiCoNER 2). In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 2247–2265, Toronto, Canada. Association for Computational Linguistics.
  10. Ralph Grishman. 2019. Twenty-five years of information extraction. Natural Language Engineering, 25(6):677–692.
  11. Diversity of scenarios in information extraction. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas, Spain.
  12. A survey on named entity recognition — datasets, tools, and methodologies. Natural Language Processing Journal, 3:100017.
  13. Overview of TAC-KBP2014 entity discovery and linking tasks. In Proceedings of Text Analysis Conference (TAC2014), pages 1333–1339.
  14. Overview of TAC-KBP2015 tri-lingual entity discovery and linking. In Proceedings of Text Analysis Conference (TAC2015).
  15. CroNER: Recognizing named entities in Croatian using conditional random fields. Informatica, 37(2):165.
  16. Michal Konkol and Miloslav Konopík. 2013. CRF-based Czech named entity recognizer and consolidation of Czech NER research. In Text, Speech and Dialogue, volume 8082 of Lecture Notes in Computer Science, pages 153–160. Springer Berlin Heidelberg.
  17. Report of NEWS 2010 transliteration mining shared task. In Proceedings of the 2010 Named Entities Workshop, pages 21–28, Uppsala, Sweden.
  18. Medisys: medical information system. In Advanced ICTs for disaster management and threat detection: Collaborative and distributed frameworks, pages 131–142. IGI Global.
  19. Combining available datasets for building named entity recognition models of Croatian and Slovene. Slovenščina 2.0: empirical, applied and interdisciplinary research, 1(2):35–57.
  20. NEREL: A Russian dataset with nested named entities, relations and events. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 876–885, Held Online. INCOMA Ltd.
  21. SemEval-2022 task 11: Multilingual complex named entity recognition (MultiCoNER). In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 1412–1437, Seattle, United States. Association for Computational Linguistics.
  22. Michał Marcińczuk. 2017. Lemmatization of multi-word common noun phrases and named entities in Polish. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, pages 483–491, Varna, Bulgaria. INCOMA Ltd.
  23. Inforex - a collaborative system for text corpora annotation and analysis. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, Varna, Bulgaria, September 2-8, 2017, pages 473–482. INCOMA Ltd.
  24. Nafise Sadat Moosavi and Michael Strube. 2016. Which coreference evaluation metric do you trust? a proposal for a link-based entity aware metric. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 632–642, Berlin, Germany. Association for Computational Linguistics.
  25. Named entities for computational linguistics. John Wiley & Sons.
  26. Maciej Ogrodniczuk and Łukasz Kobyliński, editors. 2018. Proceedings of the PolEval 2018 Workshop. Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland.
  27. Maciej Ogrodniczuk and Łukasz Kobyliński, editors. 2020. Proceedings of the PolEval 2020 Workshop. Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland.
  28. Slav-NER: the 3rd cross-lingual challenge on recognition, normalization, classification, and linking of named entities across Slavic languages. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, pages 122–133, Kiyv, Ukraine. Association for Computational Linguistics.
  29. The second cross-lingual challenge on recognition, normalization, classification, and linking of named entities across Slavic languages. In Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pages 63–74, Florence, Italy. Association for Computational Linguistics.
  30. The first cross-lingual challenge on recognition, normalization and matching of named entities in Slavic languages. In Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing. Association for Computational Linguistics.
  31. On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages. Information retrieval, 12(3):275–299.
  32. Jakub Piskorski and Roman Yangarber. 2013. Information extraction: Past, present and future. Multi-source, multilingual information extraction and summarization, pages 23–49.
  33. Agata Savary and Jakub Piskorski. 2011. Language Resources for Named Entity Annotation in the National Corpus of Polish. Control and Cybernetics, 40(2):361–391.
  34. Named entities in Czech: annotating data and developing NE tagger. In International Conference on Text, Speech and Dialogue, pages 188–195. Springer.
  35. Razpoznavanje imenskih entitet v slovenskem besedilu. Slovenščina 2.0: empirical, applied and interdisciplinary research, 1(2):58–81.
  36. FactRuEval 2016: Evaluation of named entity recognition and fact extraction systems for Russian. In Computational Linguistics and Intellectual Technologies. Proceedings of the Annual International Conference “Dialogue”, pages 688–705.
  37. Matthias Templ and Murat Sariyar. 2022. A systematic overview on methods to protect sensitive data provided for various analyses. Int. J. Inf. Secur., 21(6):1233–1246.
  38. Erik Tjong Kim Sang. 2002. Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. In Proceedings of the 6th Conference on Natural Language Learning - Volume 20, COLING-02, pages 1–4, Stroudsburg, PA, USA. Association for Computational Linguistics.
  39. Erik Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, CONLL ’03, pages 142–147, Stroudsburg, PA, USA. Association for Computational Linguistics.
  40. Tools and methodologies for annotating syntax and named entities in the National Corpus of Polish. In Proceedings of the International Multiconference on Computer Science and Information Technology (IMCSIT 2010): Computational Linguistics – Applications (CLA’10), pages 531–539, Wisła, Poland. PTI.
  41. mt5: A massively multilingual pre-trained text-to-text transformer.
  42. Vikas Yadav and Steven Bethard. 2018. A survey on recent advances in named entity recognition from deep learning models. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2145–2158, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
  43. Slav-NER: the 4th cross-lingual challenge on recognition, normalization, classification, and linking of named entities across Slavic languages. In Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023), pages 179–189, Dubrovnik, Croatia. Association for Computational Linguistics.

Summary

We haven't generated a summary for this paper yet.