Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Arabic Fine-Grained Entity Recognition (2310.17333v2)

Published 26 Oct 2023 in cs.CL

Abstract: Traditional NER systems are typically trained to recognize coarse-grained entities, and less attention is given to classifying entities into a hierarchy of fine-grained lower-level subtypes. This article aims to advance Arabic NER with fine-grained entities. We chose to extend Wojood (an open-source Nested Arabic Named Entity Corpus) with subtypes. In particular, four main entity types in Wojood, geopolitical entity (GPE), location (LOC), organization (ORG), and facility (FAC), are extended with 31 subtypes. To do this, we first revised Wojood's annotations of GPE, LOC, ORG, and FAC to be compatible with the LDC's ACE guidelines, which yielded 5, 614 changes. Second, all mentions of GPE, LOC, ORG, and FAC (~44K) in Wojood are manually annotated with the LDC's ACE sub-types. We refer to this extended version of Wojood as WojoodF ine. To evaluate our annotations, we measured the inter-annotator agreement (IAA) using both Cohen's Kappa and F1 score, resulting in 0.9861 and 0.9889, respectively. To compute the baselines of WojoodF ine, we fine-tune three pre-trained Arabic BERT encoders in three settings: flat NER, nested NER and nested NER with subtypes and achieved F1 score of 0.920, 0.866, and 0.885, respectively. Our corpus and models are open-source and available at https://sina.birzeit.edu/wojood/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. ARBERT & MARBERT: Deep bidirectional transformers for Arabic. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7088–7105, Online. Association for Computational Linguistics.
  2. Moustafa Al-Hajj and Mustafa Jarrar. 2021. Arabglossbert: Fine-tuning bert on context-gloss pairs for wsd. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 40–48, Online. INCOMA Ltd.
  3. Usability evaluation of lexicographic e-services. In The 2019 IEEE/ACS 16th International Conference on Computer Systems and Applications (AICCSA), pages 1–7. IEE.
  4. Arabert: Transformer-based model for arabic language understanding.
  5. Anersys: An arabic named entity recognition system based on maximum entropy. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 4394 LNCS.
  6. Nancy Chinchor and Patricia Robinson. 1997. Muc-7 named entity task definition. In Proceedings of the 7th Conference on Message Understanding, volume 29, pages 1–21.
  7. A panoramic survey of natural language processing in the arab worlds. Commun. ACM, 64(4):72–81.
  8. Bart Desmet and Véronique Hoste. 2013. Fine-grained dutch named entity recognition. Language Resources and Evaluation, 48:307–343.
  9. Extended overview of clef hipe 2020: named entity processing on historical newspapers. In CLEF 2020 Working Notes. Conference and Labs of the Evaluation Forum, volume 2696. CEUR-WS.
  10. Impresso named entity annotation guidelines (clef-hipe-2020). Technical report.
  11. Orca: A challenging benchmark for arabic language understanding.
  12. Barbara Di Eugenio and Michael Glass. 2004. The Kappa Statistic: A Second Look. Computational Linguistics, 30(1):95–101.
  13. Semeval-2023 task 2: Fine-grained multilingual named entity recognition (multiconer 2). arXiv preprint arXiv:2305.06586.
  14. A benchmark and scoring algorithm for enriching arabic synonyms. In Proceedings of the 12th International Global Wordnet Conference (GWC2023), pages 215–222. Global Wordnet Association.
  15. Curras + baladi: Towards a levantine corpus. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2022), Marseille, France.
  16. A multilingual dataset for named entity recognition, entity linking and stance detection in historical newspapers. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2328–2334.
  17. Effectiveness of automatic translations for cross-lingual ontology mapping. Journal of Artificial Intelligence Research, 55(1):165–208.
  18. Entity projection via machine translation for cross-lingual ner. arXiv preprint arXiv:1909.05356.
  19. Mustafa Jarrar. 2011. Building a formal arabic ontology (invited paper). In Proceedings of the Experts Meeting on Arabic Ontologies and Semantic Networks. ALECSO, Arab League.
  20. Mustafa Jarrar. 2021. The arabic ontology - an arabic wordnet with ontologically clean content. Applied Ontology Journal, 16(1):1–26.
  21. Mustafa Jarrar and Hamzeh Amayreh. 2019. An arabic-multilingual database with a lexicographic search engine. In The 24th International Conference on Applications of Natural Language to Information Systems (NLDB 2019), volume 11608 of LNCS, pages 234–246. Springer.
  22. Representing arabic lexicons in lemon - a preliminary study. In The 2nd Conference on Language, Data and Knowledge (LDK 2019), volume 2402, pages 29–33. CEUR Workshop Proceedings.
  23. Ontology-based data and process governance framework -the case of e-government interoperability in palestine. In Proceedings of the IFIP International Symposium on Data-Driven Process Discovery and Analysis (SIMPDA’11), pages 83–98.
  24. Building a corpus for palestinian arabic: a preliminary study. In Proceedings of the EMNLP 2014, Workshop on Arabic Natural Language, pages 18–27. Association For Computational Linguistics.
  25. Curras: An annotated corpus for the palestinian arabic dialect. Journal Language Resources and Evaluation, 51(3):745–775.
  26. Wojood: Nested arabic named entity corpus and recognition using bert. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2022), Marseille, France.
  27. Salma: Arabic sense-annotated corpus and wsd benchmarks. In Proceedings of the 1st Arabic Natural Language Processing Conference (ArabicNLP), Part of the EMNLP 2023. ACL.
  28. Lisan: Yemeni, irqi, libyan, and sudanese arabic dialect copora with morphological annotations. In The 20th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA). IEEE.
  29. Evaluating and combining name entity recognition systems. In Proceedings of the Sixth Named Entity Workshop, pages 21–27, Berlin, Germany. Association for Computational Linguistics.
  30. Natural language processing: State of the art, current trends and challenges. Multimedia Tools and Applications, 82.
  31. Fine-grained named entity recognition using conditional random fields for question answering. In Information Retrieval Technology, pages 581–587, Berlin, Heidelberg. Springer Berlin Heidelberg.
  32. A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering, PP:1–1.
  33. Xiao Ling and Daniel Weld. 2012. Fine-grained entity recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 26, pages 94–100.
  34. Asking questions the human way: Scalable question-answer generation from text corpus. In Proceedings of The Web Conference 2020, pages 2032–2043.
  35. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization.
  36. Nâbra: Syrian arabic dialects with morphological annotations. In Proceedings of the 1st Arabic Natural Language Processing Conference (ArabicNLP), Part of the EMNLP 2023. ACL.
  37. Ramzi Esmail Salah and Lailatul Qadri Binti Zakaria. 2018. Building the classical arabic named entity recognition corpus (canercorpus). Journal of Theoretical and Applied Information Technology, 96.
  38. Erik F Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050.
  39. Erik F. Tjong Kim Sang and Jorn Veenstra. 1999. Representing text chunks. In Proceedings of the Ninth Conference on European Chapter of the Association for Computational Linguistics, EACL ’99, page 173–179, USA. Association for Computational Linguistics.
  40. Automatic summarization of results from clinical trials. In 2011 IEEE International Conference on Bioinformatics and Biomedicine, pages 372–377. IEEE.
  41. Using bert and augmentation in named entity recognition for cybersecurity domain. In Natural Language Processing and Information Systems, pages 16–24, Cham. Springer International Publishing.
  42. Ace 2005 multilingual training corpus-linguistic data consortium. URL: https://catalog. ldc. upenn. edu/LDC2006T06.
  43. Ontonotes release 5.0 ldc2013t19. Technical report, Linguistic Data Consortium.
  44. Ziqi Zhang. 2013. Named entity recognition : challenges in document annotation, gazetteer construction and disambiguation.
  45. Fine grained named entity recognition via seq2seq framework. IEEE Access, 8:53953–53961.
Citations (12)

Summary

We haven't generated a summary for this paper yet.