Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Text Categorization Can Enhance Domain-Agnostic Stopword Extraction (2401.13398v1)

Published 24 Jan 2024 in cs.CL and cs.LG

Abstract: This paper investigates the role of text categorization in streamlining stopword extraction in NLP, specifically focusing on nine African languages alongside French. By leveraging the MasakhaNEWS, African Stopwords Project, and MasakhaPOS datasets, our findings emphasize that text categorization effectively identifies domain-agnostic stopwords with over 80% detection success rate for most examined languages. Nevertheless, linguistic variances result in lower detection rates for certain languages. Interestingly, we find that while over 40% of stopwords are common across news categories, less than 15% are unique to a single category. Uncommon stopwords add depth to text but their classification as stopwords depends on context. Therefore combining statistical and linguistic approaches creates comprehensive stopword lists, highlighting the value of our hybrid method. This research enhances NLP for African languages and underscores the importance of text categorization in stopword extraction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Ali A. Abdi. 2009. Oral societies and colonial experiences: Sub-saharan africa and the de-facto power of the written word. In Education, Decolonization and Development, pages 39–56. BRILL.
  2. MasakhaNEWS: News Topic Classification for African languages.
  3. Toluwase Victor Asubiaro. 2013. Entropy-based generic stopwords list for Yoruba texts. International Journal of Computer and Information Technology, 2(5).
  4. Olusanmi Babarinde. 2014. Linguistic analysis of the structure of yoruba numerals. Language Matters, 45(1):127–147.
  5. MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page 10883–10900. Association for Computational Linguistics.
  6. Khalifa Chekima and Rayner Alfred. 2016. An Automatic Construction of Malay Stop Words Based on Aggregation Method. In Communications in Computer and Information Science, pages 180–189. Springer Singapore.
  7. Ljiljana Dolamic and Jacques Savoy. 2009. When stopword lists make the difference. Journal of the American Society for Information Science and Technology, 61(1):200–203.
  8. The African Stopwords project: curating stopwords for African languages.
  9. Automatic learning of linguistic resources for stopword removal and stemming from text. Procedia Computer Science, 38:116–123.
  10. Grammar Error Detection Tool for Medical Transcription Using Stop Words Parts-of-Speech Tags Ngram Based Model. In Proceedings of the Second International Conference on Computational Intelligence and Informatics, pages 37–49. Springer Singapore.
  11. A universal information theoretic approach to the identification of stopwords. Nature Machine Intelligence, 1(12):606–612.
  12. Kripabandhu Ghosh and Arnab Bhattacharya. 2017. Stopword removal: Why bother? A case study on verbose queries. In Proceedings of the 10th Annual ACM India Compute Conference, pages 99–102.
  13. Stop words detection using a long short term memory recurrent neural network. In Proceedings of the 2021 9th International Conference on Information Technology: IoT and Smart City, pages 199–202.
  14. Kristín M. Jóhannsdóttir. 2007. Temporal adverbs in icelandic: adverbs of quantification vs. frequency adverbs. Nordic Journal of Linguistics, 30(2):157–183.
  15. Francis Katamba. 1984. A nonlinear analysis of vowel harmony in luganda. Journal of Linguistics, 20(2):257–275.
  16. Edward L. Keenan and Jonathan Stavi. 1986. A semantic characterization of natural language determiners. Linguistics and Philosophy, 9(3):253–326.
  17. Dhara J. Ladani and Nikita P. Desai. 2020. Stopword Identification and Removal Techniques on TC and IR applications: A Survey. In 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS). IEEE.
  18. Sileshi Girmaw Miretie and Vijayshri Khedkar. 2018. Automatic generation of stopwords in the Amharic text. International Journal of Computer Applications, 975:8887.
  19. KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5507–5521, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  20. Toward an effective igbo part-of-speech tagger. ACM Transactions on Asian and Low-Resource Language Information Processing, 18(4):1–26.
  21. Masakhane - Machine Translation For Africa. arXiv preprint arXiv: 2003.11529.
  22. Rachel Panckhurst. 2009. Texting in three European languages : does the linguistic typology differ ? In i-Mean 2009 Issues in Meaning in Interaction, pages 119–136, Bristol, United Kingdom.
  23. Understanding the Behaviors of BERT in Ranking.
  24. Ruby Rani and D.K. Lobiyal. 2018. Automatic Construction of Generic Stop Words List for Hindi Text. Procedia Computer Science, 132:362–370.
  25. Serhad Sarica and Jianxi Luo. 2021. Stopwords in technical language processing. PLOS ONE, 16(8):e0254937.
  26. Anne M. Treisman. 1964. Verbal cues, language, and meaning in selective attention. The American Journal of Psychology, 77(2):206.
  27. Amharic Adhoc Information Retrieval System Based on Morphological Features. Applied Sciences, 12(3):1294.

Summary

  • The paper demonstrates that combining text categorization with hybrid linguistic and statistical methods achieves over 80% stopword detection success in multiple languages.
  • The methodology leverages large datasets from African languages and French to address challenges in low-resource, morphologically complex languages.
  • The findings highlight the contextual variability of stopwords across domains, suggesting avenues for future cross-disciplinary NLP research.

Introduction

The pursuit of efficient NLP relies notably on the ability to omit semantically insignificant words, colloquially known as stopwords. In text analysis, their exclusion allows models to focus on content that carries meaning and importance. This paper explores novel research concerning stopword extraction, notably in a multilingual context that encompasses nine African languages alongside French. It leverages larger datasets such as the MasakhaNEWS, African Stopwords Project, and MasakhaPOS to authenticate the potential of text categorization in enhancing domain-agnostic stopword identification.

Methodology

Stopword extraction is traditionally facilitated by linguistic and statistical methods. Linguistic techniques make use of predefined lists, whereas statistical methods apply automatic identification based on word frequencies and patterns. With advancements in deep learning, newer extraction techniques using context-aware models have emerged. In addressing the challenge of creating comprehensive language-specific stopword lists, especially for linguistically diverse and low-resource languages, combining these approaches could prove more effective. This paper employed such hybrid methodologies to address the unique linguistic intricacies of African languages, which have historically been left on the periphery of NLP research.

Analysis

Utilizing datasets representative of diverse language families, the researchers performed a meticulous analysis on how stopwords distribute across different newscaster series. By examining the stopwords' presence in various categories such as sports, business, politics, and others, the paper sheds light on the percentage of stopwords common to all subjects and those that are unique to specific ones. Intriguingly, while a significant share of stopwords appeared to be common across categories, they also identified stopwords that are unique, adding lexical richness to text. This nuanced finding underscores the contextual dependence of stopword classification.

Findings and Implications

The paper presents compelling outcomes noting over 80% detection success rates for most languages, while also acknowledging variances due to linguistic differences. African languages, in particular, pose distinct challenges due to agglutinative traits—a phenomenon that warrants further exploration. The analysis accentuates that stopwords can not only vary notably between languages but also within various domains. As such, the relevance of stopwords could fluctuate, challenging the very definition of what constitutes a stopword. Researchers propose future studies to expand the scope of language coverage, hone hybrid techniques for extraction, and embrace the diversities accentuated by languages with morphological complexities.

Conclusion

In sum, this investigation enriches the field of NLP by offering an illuminated pathway towards creating extensive and nuanced stopword lists, with a special focus on bridging the resource gap for African languages. It further delineates the integral role of text categorization in refining the stopword extraction process and calls for a multi-faceted approach that incorporates statistical relevance, linguistic understanding, and the pivotal influence of contextual analysis. The findings foster a cross-disciplinary collaboration among computational linguists, NLP practitioners, and native language speakers—each lending their expertise to the collective goal of advancing language technology in a culturally and linguistically diverse world.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets