Curating Stopwords in Marathi: A TF-IDF Approach for Improved Text Analysis and Information Retrieval (2406.11029v1)

Published 16 Jun 2024 in cs.CL and cs.LG

Abstract: Stopwords are commonly used words in a language that are often considered to be of little value in determining the meaning or significance of a document. These words occur frequently in most texts and don't provide much useful information for tasks like sentiment analysis and text classification. English, which is a high-resource language, takes advantage of the availability of stopwords, whereas low-resource Indian languages like Marathi are very limited, standardized, and can be used in available packages, but the number of available words in those packages is low. Our work targets the curation of stopwords in the Marathi language using the MahaCorpus, with 24.8 million sentences. We make use of the TF-IDF approach coupled with human evaluation to curate a strong stopword list of 400 words. We apply the stop word removal to the text classification task and show its efficacy. The work also presents a simple recipe for stopword curation in a low-resource language. The stopwords are integrated into the mahaNLP library and publicly available on https://github.com/l3cube-pune/MarathiNLP .

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a TF-IDF method combined with human evaluation to curate a 400-word Marathi stopword list.
The methodology leverages the MahaCorpus, processing 24.8 million sentences segmented into 20 subsets for effective analysis.
Experimental results demonstrate improved sentiment analysis accuracy and maintained transformer model performance across tasks.

Curating Stopwords in Marathi: A TF-IDF Approach for Improved Text Analysis and Information Retrieval

This paper conducts a detailed exploration of stopword curation for the Marathi language, employing a TF-IDF (Term Frequency-Inverse Document Frequency) approach alongside human evaluation. The authors aim to bridge the significant gap in computational resources available for low-resource languages like Marathi, which lacks a comprehensive list of stopwords compared to high-resource languages such as English.

Methodology and Approach

The authors leverage the MahaCorpus, a substantial dataset encompassing 24.8 million sentences in Marathi, to systematically curate a stopword list. The methodology involves segmenting the corpus into 20 subsets, enhancing computational manageability while maintaining content diversity. TF-IDF calculations are performed on these subsets, identifying terms with the lowest TF-IDF scores that frequently appear but offer little informational value.

A critical post-processing step involves human evaluation, where three native Marathi speakers review a list of 2297 potential stopwords. Through a majority voting process, they distill this list to 400 words, ensuring it embodies the linguistic intricacies of Marathi. This curated list integrates linguistic relevance and computational efficacy, enriching the mahaNLP library as a publicly available resource.

Experimental Results and Impact

The paper emphasizes the practicality of this stopword list in text classification tasks using the L3Cube-MahaNews dataset. The integration of stopword removal is tested with pre-trained transformer models, IndicBERT and MahaBERT, showcasing marginal impacts on accuracy. Specifically, the models maintained their robustness with accuracy metrics showing a minimal decrease, highlighting the resilience of these models and the minimal adverse effect of stopword removal.

In sentiment analysis tasks, leveraging the MahaSENT dataset, an improvement in accuracy is observed with stopword removal using the MahaBERT model. This suggests that stopword removal can enhance sentiment analysis performance, underscoring the potential utility of the curated stopword list in real-world applications. The dual-task analysis signifies that stopwords have varied impacts across different NLP applications.

Implications and Future Directions

This paper represents a foundational effort in curating stopwords for the Marathi language, serving as an essential resource for Marathi NLP research and practical applications. The 400-word stopword list, meticulously evaluated and validated, stands as a pioneering contribution to Marathi language processing, paving the way for enhanced text analysis and information retrieval in low-resource language contexts.

Moreover, this work underscores the importance of language-specific resources in NLP, drawing attention to the unique challenges and opportunities associated with low-resource languages. Future research could expand on this foundation by exploring domain-specific stopword lists or integrating the curated list into other NLP tasks in Marathi, such as machine translation or entity recognition.

In conclusion, the authors present a pragmatic and linguistically informed approach to stopword curation in Marathi. Their work not only addresses a significant gap in language resources but also contributes to advancing linguistic computing for a broader spectrum of low-resource languages. The integration of both automatic processes and human judgment ensures that the resulting stopword list effectively harmonizes computational efficiency with linguistic authenticity.

PDF Markdown

Related Papers

GitHub

GitHub - l3cube-pune/MarathiNLP: Marathi NLP - is a repository dedicated to development of tools and resources for Marathi language. (110 stars)