Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction (2402.14521v1)

Published 22 Feb 2024 in cs.CL

Abstract: Standard English and Malaysian English exhibit notable differences, posing challenges for NLP tasks on Malaysian English. Unfortunately, most of the existing datasets are mainly based on standard English and therefore inadequate for improving NLP tasks in Malaysian English. An experiment using state-of-the-art Named Entity Recognition (NER) solutions on Malaysian English news articles highlights that they cannot handle morphosyntactic variations in Malaysian English. To the best of our knowledge, there is no annotated dataset available to improvise the model. To address these issues, we constructed a Malaysian English News (MEN) dataset, which contains 200 news articles that are manually annotated with entities and relations. We then fine-tuned the spaCy NER tool and validated that having a dataset tailor-made for Malaysian English could improve the performance of NER in Malaysian English significantly. This paper presents our effort in the data acquisition, annotation methodology, and thorough analysis of the annotated dataset. To validate the quality of the annotation, inter-annotator agreement was used, followed by adjudication of disagreements by a subject matter expert. Upon completion of these tasks, we managed to develop a dataset with 6,061 entities and 3,268 relation instances. Finally, we discuss on spaCy fine-tuning setup and analysis on the NER performance. This unique dataset will contribute significantly to the advancement of NLP research in Malaysian English, allowing researchers to accelerate their progress, particularly in NER and relation extraction. The dataset and annotation guideline has been published on Github.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Consistency is key: Disentangling label variation in natural language processing with intra-annotator agreement.
  2. Masakhaner: Named entity recognition for african languages.
  3. Flair: An easy-to-use framework for state-of-the-art nlp. In NAACL 2019, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 54–59.
  4. Ron Artstein. 2017. Inter-annotator Agreement, pages 297–313. Springer Netherlands, Dordrecht.
  5. Lukas Biewald. 2020. Experiment tracking with weights and biases. Software available from wandb.com.
  6. Creating a dataset for named entity recognition in the archaeology domain. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4573–4577, Marseille, France. European Language Resources Association.
  7. Thai nested named entity recognition corpus. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1473–1486, Dublin, Ireland. Association for Computational Linguistics.
  8. A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine. BMC Med Inform Decis Mak, 21(1):69.
  9. Yee Seng Chan and Dan Roth. 2011. Exploiting syntactico-semantic structures for relation extraction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 551–560, Portland, Oregon, USA. Association for Computational Linguistics.
  10. Unsupervised cross-lingual representation learning at scale.
  11. Ralph Grishman and Beth Sundheim. 1996. Message Understanding Conference- 6: A brief history. In COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics.
  12. FewRel: A large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4803–4809, Brussels, Belgium. Association for Computational Linguistics.
  13. Ontonotes: The 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, NAACL-Short ’06, page 57–60, USA. Association for Computational Linguistics.
  14. George Hripcsak and Adam Rothschild. 2005. Agreement, the f-measure, and reliability in information retrieval. Journal of the American Medical Informatics Association : JAMIA, 12:296–8.
  15. Zolkepli Husein. 2018. Malaya. https://github.com/huseinzol05/malaya.
  16. DaNE: A named entity resource for Danish. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4597–4604, Marseille, France. European Language Resources Association.
  17. T.S. Imm. 2014. Exploring the malaysian english newspaper corpus for lexicographic evidence. 32:167–185.
  18. Malaysian english versus standard english: Which is favored?
  19. Radgraph: Extracting clinical entities and relations from radiology reports.
  20. Wojood: Nested arabic named entity corpus and recognition using bert.
  21. Annotating the Tweebank corpus on named entity recognition and building NLP models for social media analysis. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 7199–7208, Marseille, France. European Language Resources Association.
  22. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction.
  23. Multiconer: A large-scale multilingual dataset for complex named entity recognition.
  24. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60, Baltimore, Maryland. Association for Computational Linguistics.
  25. explosion/spaCy: v3.7.1: Bug fix for ‘spacy.cli‘ module loading.
  26. doccano: Text annotation tool for human. Software available from https://github.com/doccano/doccano.
  27. Naijaner : Comprehensive named entity recognition for 5 nigerian languages.
  28. Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations.
  29. Inter-sentence relation extraction with document-level graph convolutional neural network. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4309–4316, Florence, Italy. Association for Computational Linguistics.
  30. Finer: Financial named entity recognition dataset and weak-supervision model.
  31. Kumutha Swampillai and Mark Stevenson. 2011. Extracting relations within and across sentences. In Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, pages 25–32, Hissar, Bulgaria. Association for Computational Linguistics.
  32. Siew Tan. 2009. Lexical borrowing in malaysian english: Influences of malay. Lexis, 3.
  33. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
  34. Christopher Walker. 2005. Multilingual Training Corpus LDC2006T06. Web Download. Philadelphia: Linguistic Data Consortium.
  35. Global-to-local neural networks for document-level relation extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3711–3721, Online. Association for Computational Linguistics.
  36. CodRED: A cross-document relation extraction dataset for acquiring knowledge in the wild. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4452–4472, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  37. DocRED: A large-scale document-level relation extraction dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 764–777, Florence, Italy. Association for Computational Linguistics.
  38. Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), pages 35–45.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Mohan Raj Chanthran (3 papers)
  2. Lay-Ki Soon (15 papers)
  3. Huey Fang Ong (3 papers)
  4. Bhawani Selvaretnam (7 papers)

Summary

We haven't generated a summary for this paper yet.