Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CebuaNER: A New Baseline Cebuano Named Entity Recognition Model (2310.00679v1)

Published 1 Oct 2023 in cs.CL

Abstract: Despite being one of the most linguistically diverse groups of countries, computational linguistics and language processing research in Southeast Asia has struggled to match the level of countries from the Global North. Thus, initiatives such as open-sourcing corpora and the development of baseline models for basic language processing tasks are important stepping stones to encourage the growth of research efforts in the field. To answer this call, we introduce CebuaNER, a new baseline model for named entity recognition (NER) in the Cebuano language. Cebuano is the second most-used native language in the Philippines, with over 20 million speakers. To build the model, we collected and annotated over 4,000 news articles, the largest of any work in the language, retrieved from online local Cebuano platforms to train algorithms such as Conditional Random Field and Bidirectional LSTM. Our findings show promising results as a new baseline model, achieving over 70% performance on precision, recall, and F1 across all entity tags, as well as potential efficacy in a crosslingual setup with Tagalog.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. MasakhaNER 2.0: Africa-centric transfer learning for named entity recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4488–4508, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  2. MasakhaNER: Named entity recognition for African languages. Transactions of the Association for Computational Linguistics, 9:1116–1131.
  3. One country, 700+ languages: NLP challenges for underrepresented languages and dialects in Indonesia. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7226–7249, Dublin, Ireland. Association for Computational Linguistics.
  4. Named entity recognizer for filipino text using conditional random field. International Journal of Future Computer and Communication, 2(5):376.
  5. Named entity recognition in wikipedia. In Proceedings of the 2009 workshop on the people’s web meets NLP: Collaboratively constructed semantic resources (People’s Web), pages 10–18.
  6. Creating a dataset for named entity recognition in the archaeology domain. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4573–4577.
  7. Graph propagation based data augmentation for named entity recognition. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 110–118, Toronto, Canada. Association for Computational Linguistics.
  8. Jason P.C. Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 4:357–370.
  9. Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.
  10. Ryan Cotterell and Kevin Duh. 2017a. Low-resource named entity recognition with cross-lingual, character-level neural conditional random fields. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 91–96, Taipei, Taiwan. Asian Federation of Natural Language Processing.
  11. Ryan Cotterell and Kevin Duh. 2017b. Low-resource named entity recognition with cross-lingual, character-level neural conditional random fields. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 91–96.
  12. Named-entity recognition for disaster related filipino news articles. In TENCON 2018-2018 IEEE Region 10 Conference, pages 1633–1636. IEEE.
  13. Named-entity recognizer (ner) for filipino novel excerpts using maximum entropy approach. Journal of Industrial and Intelligent Information Vol, 1(1).
  14. Semantic relation extraction: A review of approaches, datasets, and evaluation methods. ACM Transactions on Asian and Low-Resource Language Information Processing.
  15. Developing a hybrid neural network for part-of-speech tagging and named entity recognition. In Proceedings of the 2022 5th Artificial Intelligence and Cloud Computing Conference, pages 7–13.
  16. Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157–1182.
  17. Glot500: Scaling multilingual corpora and language models to 500 languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1082–1117, Toronto, Canada. Association for Computational Linguistics.
  18. Joseph Marvin Imperial and Ekaterina Kochmar. 2023. Automatic readability assessment for closely related languages. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5371–5386, Toronto, Canada. Association for Computational Linguistics.
  19. A baseline readability model for Cebuano. In Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022), pages 27–32, Seattle, Washington. Association for Computational Linguistics.
  20. Wojood: Nested Arabic named entity corpus and recognition using BERT. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3626–3636, Marseille, France. European Language Resources Association.
  21. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 282–289.
  22. Dong C Liu and Jorge Nocedal. 1989. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528.
  23. Empower sequence labeling with task-aware neural language model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
  24. Ben Lorica and Paco Nathan. 2021. 2021 nlp survey report.
  25. Named entity recognition with partially annotated training data. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 645–655, Hong Kong, China. Association for Computational Linguistics.
  26. Stephen Mayhew and Dan Roth. 2018. TALEN: Tool for annotation of low-resource ENtities. In Proceedings of ACL 2018, System Demonstrations, pages 80–86, Melbourne, Australia. Association for Computational Linguistics.
  27. Ne recognition without training data on a language you don’t speak. In Proceedings of the ACL 2003 workshop on multilingual and mixed-language named entity recognition, pages 33–40.
  28. Curtis D McFarland. 2008. Linguistic diversity and english in the philippines. Philippine English: Linguistic and literary perspectives, 1:131.
  29. The challenge of implementing mother tongue education in linguistically diverse contexts: The case of the philippines. The Asia-Pacific Education Researcher, 25:781–789.
  30. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958, Vancouver, Canada. Association for Computational Linguistics.
  31. Rrubaa Panchendrarajan and Aravindh Amaresan. 2018. Bidirectional lstm-crf for named entity recognition. In Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation.
  32. Named entity recognition of kumauni language using machine learning (ml).
  33. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
  34. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830.
  35. Nils Reimers and Iryna Gurevych. 2017. Reporting score distributions makes a difference: Performance study of lstm-networks for sequence tagging. arXiv preprint arXiv:1707.09861.
  36. Soft gazetteers for low-resource named entity recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8118–8123, Online. Association for Computational Linguistics.
  37. Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681.
  38. Assessing digital language support on a global scale. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4299–4305, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  39. Sowmya Vajjala and Ramya Balasubramaniam. 2022. What do we really know about state of the art NER? In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5983–5993, Marseille, France. European Language Resources Association.
  40. Hanna M Wallach. 2004. Conditional random fields: An introduction. Technical Reports (CIS), page 22.
  41. IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 843–857, Suzhou, China. Association for Computational Linguistics.
  42. Neural cross-lingual named entity recognition with minimal resources. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 369–379, Brussels, Belgium. Association for Computational Linguistics.
  43. Usama Yaseen and Stefan Langer. 2021. Data augmentation for low-resource named entity recognition using backtranslation. In Proceedings of the 18th International Conference on Natural Language Processing (ICON), pages 352–358, National Institute of Technology Silchar, Silchar, India. NLP Association of India (NLPAI).
  44. Dual adversarial neural transfer for low-resource named entity recognition. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3461–3471, Florence, Italy. Association for Computational Linguistics.
  45. ConNER: Consistency training for cross-lingual named entity recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8438–8449, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  46. Named entity recognition with parallel recurrent neural networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 69–74, Melbourne, Australia. Association for Computational Linguistics.

Summary

We haven't generated a summary for this paper yet.