Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting an Under-Resourced Language (2306.14866v1)

Published 26 Jun 2023 in cs.CL

Abstract: In this paper we address the scarcity of annotated data for NArabizi, a Romanized form of North African Arabic used mostly on social media, which poses challenges for NLP. We introduce an enriched version of NArabizi Treebank (Seddah et al., 2020) with three main contributions: the addition of two novel annotation layers (named entity recognition and offensive language detection) and a re-annotation of the tokenization, morpho-syntactic and syntactic layers that ensure annotation consistency. Our experimental results, using different tokenization schemes, showcase the value of our contributions and highlight the impact of working with non-gold tokenization for NER and dependency parsing. To facilitate future research, we make these annotations publicly available. Our enhanced NArabizi Treebank paves the way for creating sophisticated LLMs and NLP tools for this under-represented language.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Dziribert: a pre-trained language model for the algerian dialect. arXiv preprint arXiv:2109.12346.
  2. MasakhaNER: Named entity recognition for African languages. Transactions of the Association for Computational Linguistics, 9:1116–1131.
  3. Dina Almanea and Massimo Poesio. 2022. ArMIS - the Arabic misogyny and sexism corpus with annotator subjective disagreements. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2282–2291, Marseille, France. European Language Resources Association.
  4. Ctab: Corpus of tunisian arabizi. This corpus has been developed by the Data Engineering and Semantics Research Unit (DES- Unit), University of Sfax, Tunisia. It has been developed to increase the coverage of Latin Script in the NLP resources for Tunisian. It is included as a part of the Tunisian Arabic Corpus (http://www.tunisiya.org/).
  5. Addressing code-switching in french/algerian arabic speech. In Interspeech 2017, pages 62–66.
  6. Dan Bareket and Reut Tsarfaty. 2021. Neural modeling for named entities and morphology (NEMO2). Transactions of the Association for Computational Linguistics, 9:909–928.
  7. Valerio Basile et al. 2020. It’s the end of the gold standard as we know it. on the impact of pre-aggregation on the evaluation of highly subjective tasks. In CEUR WORKSHOP PROCEEDINGS, volume 2776, pages 31–40. CEUR-WS.
  8. Victoria Bobicev and Marina Sokolova. 2017. Inter-annotator agreement in sentiment analysis: Machine learning perspective. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, pages 97–102, Varna, Bulgaria. INCOMA Ltd.
  9. On detecting errors in dependency treebanks. Research on Language and Computation, 6(2):113–137.
  10. Nusacrowd: Open source initiative for indonesian nlp resources.
  11. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar. Association for Computational Linguistics.
  12. Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46.
  13. An Algerian Arabic-French code-switched corpus. In Workshop on Free/Open-Source Arabic Corpora and Corpora Processing Tools Workshop Programme, page 34.
  14. Universal Dependencies. Computational Linguistics, 47(2):255–308.
  15. Broad Twitter corpus: A diverse named entity recognition resource. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1169–1179, Osaka, Japan. The COLING 2016 Organizing Committee.
  16. Results of the WNUT2017 shared task on novel and emerging entity recognition. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 140–147, Copenhagen, Denmark. Association for Computational Linguistics.
  17. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  18. Jacob Eisenstein. 2013. What to do about bad language on the internet. In Proceedings of the 2013 conference of the North American Chapter of the association for computational linguistics: Human language technologies, pages 359–369.
  19. CharacterBERT: Reconciling ELMo and BERT for word-level open-vocabulary representations from characters. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6903–6915, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  20. Jennifer Foster. 2010. “cba to check the spelling”: Investigating parser performance on discussion forum posts. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 381–384, Los Angeles, California. Association for Computational Linguistics.
  21. From news to comment: Resources and benchmarks for parsing the language of web 2.0. In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 893–901, Chiang Mai, Thailand. Asian Federation of Natural Language Processing.
  22. Bruno Guillaume. 2021. Graph matching and graph rewriting: Grew tools for corpus exploration, maintenance and conversion. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 168–175.
  23. Nizar Habash. 2010. Introduction to Arabic Natural Language Processing. Morgan and Claypool.
  24. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
  25. The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In proceedings of the 27th international conference on computational linguistics: system demonstrations, pages 5–9.
  26. How does the granularity of an annotation scheme influence dependency parsing performance? In Proceedings of COLING 2012: Posters, pages 839–852, Mumbai, India. The COLING 2012 Organizing Committee.
  27. Can multilingual language models transfer to an unseen dialect? a case study on north african arabizi. arXiv preprint arXiv:2005.00318.
  28. Shuyo Nakatani. 2010. Language detection library for java.
  29. Hiroki Nakayama. 2018. seqeval: A python framework for sequence labeling evaluation. Software available from https://github.com/chakki-works/seqeval.
  30. Universal Dependencies v2: An evergrowing multilingual treebank collection. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4034–4043, Marseille, France. European Language Resources Association.
  31. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958, Vancouver, Canada. Association for Computational Linguistics.
  32. tweeDe – a Universal Dependencies treebank for German tweets. In Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), pages 100–108, Paris, France. Association for Computational Linguistics.
  33. Can character-based language models improve downstream task performances in low-resource and noisy language scenarios? In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 423–436, Online. Association for Computational Linguistics.
  34. Named entity recognition in tweets: An experimental study. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1524–1534, Edinburgh, Scotland, UK. Association for Computational Linguistics.
  35. The Hebrew Universal Dependency treebank: Past present and future. In Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), pages 133–143, Brussels, Belgium. Association for Computational Linguistics.
  36. Annotation référentielle du corpus arboré de paris 7 en entités nommées. In Traitement Automatique des Langues Naturelles (TALN), volume 2.
  37. Treebanking user-generated content: A proposal for a unified representation in Universal Dependencies. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5240–5250, Marseille, France. European Language Resources Association.
  38. Treebanking user-generated content: a ud based overview of guidelines, corpora and unified recommendations. Language Resources and Evaluation, pages 1–52.
  39. PoSTWITA-UD: an Italian Twitter treebank in Universal Dependencies. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  40. Natalie Schluter and Josef van Genabith. 2007. Preparing, restructuring, and augmenting a french treebank: Lexicalised parsers or coherent treebanks?
  41. Building a user-generated content North-African Arabizi treebank: Tackling hell. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1139–1150, Online. Association for Computational Linguistics.
  42. The French Social Media Bank: a treebank of noisy user generated content. In Proceedings of COLING 2012, pages 2441–2458, Mumbai, India. The COLING 2012 Organizing Committee.
  43. UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4290–4297, Portorož, Slovenia. European Language Resources Association (ELRA).
  44. Erik F. Tjong Kim Sang. 2002. Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. In COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002).
  45. Erik F. Tjong Kim Sang and Jorn Veenstra. 1999. Representing text chunks. In Ninth Conference of the European Chapter of the Association for Computational Linguistics, pages 173–179, Bergen, Norway. Association for Computational Linguistics.
  46. Samia Touileb. 2022. Nerdz: A preliminary dataset of named entities for algerian. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, pages 95–101.
  47. Samia Touileb and Jeremy Barnes. 2021. The interplay between language similarity and script on a novel multi-layer Algerian dialect corpus. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3700–3712, Online. Association for Computational Linguistics.
  48. Turkish treebanking: Unifying and constructing efforts. In Proceedings of the 13th Linguistic Annotation Workshop, pages 166–177, Florence, Italy. Association for Computational Linguistics.
  49. Learning from disagreement: A survey. Journal of Artificial Intelligence Research, 72:1385–1470.
  50. Guillaume Wisniewski. 2018. Errator: a tool to help detect annotation errors in the universal dependencies project. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
  51. Language resources for maghrebi arabic dialects’ nlp: a survey. Language Resources and Evaluation, 54(4):1079–1142.
  52. Building an endangered language resource in the classroom: Universal Dependencies for kakataibo. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3840–3851, Marseille, France. European Language Resources Association.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Arij Riabi (9 papers)
  2. Menel Mahamdi (3 papers)
  3. Djamé Seddah (28 papers)
Citations (5)