Text Categorization Can Enhance Domain-Agnostic Stopword Extraction (2401.13398v1)
Abstract: This paper investigates the role of text categorization in streamlining stopword extraction in NLP, specifically focusing on nine African languages alongside French. By leveraging the MasakhaNEWS, African Stopwords Project, and MasakhaPOS datasets, our findings emphasize that text categorization effectively identifies domain-agnostic stopwords with over 80% detection success rate for most examined languages. Nevertheless, linguistic variances result in lower detection rates for certain languages. Interestingly, we find that while over 40% of stopwords are common across news categories, less than 15% are unique to a single category. Uncommon stopwords add depth to text but their classification as stopwords depends on context. Therefore combining statistical and linguistic approaches creates comprehensive stopword lists, highlighting the value of our hybrid method. This research enhances NLP for African languages and underscores the importance of text categorization in stopword extraction.
- Ali A. Abdi. 2009. Oral societies and colonial experiences: Sub-saharan africa and the de-facto power of the written word. In Education, Decolonization and Development, pages 39–56. BRILL.
- MasakhaNEWS: News Topic Classification for African languages.
- Toluwase Victor Asubiaro. 2013. Entropy-based generic stopwords list for Yoruba texts. International Journal of Computer and Information Technology, 2(5).
- Olusanmi Babarinde. 2014. Linguistic analysis of the structure of yoruba numerals. Language Matters, 45(1):127–147.
- MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page 10883–10900. Association for Computational Linguistics.
- Khalifa Chekima and Rayner Alfred. 2016. An Automatic Construction of Malay Stop Words Based on Aggregation Method. In Communications in Computer and Information Science, pages 180–189. Springer Singapore.
- Ljiljana Dolamic and Jacques Savoy. 2009. When stopword lists make the difference. Journal of the American Society for Information Science and Technology, 61(1):200–203.
- The African Stopwords project: curating stopwords for African languages.
- Automatic learning of linguistic resources for stopword removal and stemming from text. Procedia Computer Science, 38:116–123.
- Grammar Error Detection Tool for Medical Transcription Using Stop Words Parts-of-Speech Tags Ngram Based Model. In Proceedings of the Second International Conference on Computational Intelligence and Informatics, pages 37–49. Springer Singapore.
- A universal information theoretic approach to the identification of stopwords. Nature Machine Intelligence, 1(12):606–612.
- Kripabandhu Ghosh and Arnab Bhattacharya. 2017. Stopword removal: Why bother? A case study on verbose queries. In Proceedings of the 10th Annual ACM India Compute Conference, pages 99–102.
- Stop words detection using a long short term memory recurrent neural network. In Proceedings of the 2021 9th International Conference on Information Technology: IoT and Smart City, pages 199–202.
- Kristín M. Jóhannsdóttir. 2007. Temporal adverbs in icelandic: adverbs of quantification vs. frequency adverbs. Nordic Journal of Linguistics, 30(2):157–183.
- Francis Katamba. 1984. A nonlinear analysis of vowel harmony in luganda. Journal of Linguistics, 20(2):257–275.
- Edward L. Keenan and Jonathan Stavi. 1986. A semantic characterization of natural language determiners. Linguistics and Philosophy, 9(3):253–326.
- Dhara J. Ladani and Nikita P. Desai. 2020. Stopword Identification and Removal Techniques on TC and IR applications: A Survey. In 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS). IEEE.
- Sileshi Girmaw Miretie and Vijayshri Khedkar. 2018. Automatic generation of stopwords in the Amharic text. International Journal of Computer Applications, 975:8887.
- KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5507–5521, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Toward an effective igbo part-of-speech tagger. ACM Transactions on Asian and Low-Resource Language Information Processing, 18(4):1–26.
- Masakhane - Machine Translation For Africa. arXiv preprint arXiv: 2003.11529.
- Rachel Panckhurst. 2009. Texting in three European languages : does the linguistic typology differ ? In i-Mean 2009 Issues in Meaning in Interaction, pages 119–136, Bristol, United Kingdom.
- Understanding the Behaviors of BERT in Ranking.
- Ruby Rani and D.K. Lobiyal. 2018. Automatic Construction of Generic Stop Words List for Hindi Text. Procedia Computer Science, 132:362–370.
- Serhad Sarica and Jianxi Luo. 2021. Stopwords in technical language processing. PLOS ONE, 16(8):e0254937.
- Anne M. Treisman. 1964. Verbal cues, language, and meaning in selective attention. The American Journal of Psychology, 77(2):206.
- Amharic Adhoc Information Retrieval System Based on Morphological Features. Applied Sciences, 12(3):1294.