- The paper introduces a TF-IDF method combined with human evaluation to curate a 400-word Marathi stopword list.
- The methodology leverages the MahaCorpus, processing 24.8 million sentences segmented into 20 subsets for effective analysis.
- Experimental results demonstrate improved sentiment analysis accuracy and maintained transformer model performance across tasks.
Curating Stopwords in Marathi: A TF-IDF Approach for Improved Text Analysis and Information Retrieval
This paper conducts a detailed exploration of stopword curation for the Marathi language, employing a TF-IDF (Term Frequency-Inverse Document Frequency) approach alongside human evaluation. The authors aim to bridge the significant gap in computational resources available for low-resource languages like Marathi, which lacks a comprehensive list of stopwords compared to high-resource languages such as English.
Methodology and Approach
The authors leverage the MahaCorpus, a substantial dataset encompassing 24.8 million sentences in Marathi, to systematically curate a stopword list. The methodology involves segmenting the corpus into 20 subsets, enhancing computational manageability while maintaining content diversity. TF-IDF calculations are performed on these subsets, identifying terms with the lowest TF-IDF scores that frequently appear but offer little informational value.
A critical post-processing step involves human evaluation, where three native Marathi speakers review a list of 2297 potential stopwords. Through a majority voting process, they distill this list to 400 words, ensuring it embodies the linguistic intricacies of Marathi. This curated list integrates linguistic relevance and computational efficacy, enriching the mahaNLP library as a publicly available resource.
Experimental Results and Impact
The paper emphasizes the practicality of this stopword list in text classification tasks using the L3Cube-MahaNews dataset. The integration of stopword removal is tested with pre-trained transformer models, IndicBERT and MahaBERT, showcasing marginal impacts on accuracy. Specifically, the models maintained their robustness with accuracy metrics showing a minimal decrease, highlighting the resilience of these models and the minimal adverse effect of stopword removal.
In sentiment analysis tasks, leveraging the MahaSENT dataset, an improvement in accuracy is observed with stopword removal using the MahaBERT model. This suggests that stopword removal can enhance sentiment analysis performance, underscoring the potential utility of the curated stopword list in real-world applications. The dual-task analysis signifies that stopwords have varied impacts across different NLP applications.
Implications and Future Directions
This paper represents a foundational effort in curating stopwords for the Marathi language, serving as an essential resource for Marathi NLP research and practical applications. The 400-word stopword list, meticulously evaluated and validated, stands as a pioneering contribution to Marathi language processing, paving the way for enhanced text analysis and information retrieval in low-resource language contexts.
Moreover, this work underscores the importance of language-specific resources in NLP, drawing attention to the unique challenges and opportunities associated with low-resource languages. Future research could expand on this foundation by exploring domain-specific stopword lists or integrating the curated list into other NLP tasks in Marathi, such as machine translation or entity recognition.
In conclusion, the authors present a pragmatic and linguistically informed approach to stopword curation in Marathi. Their work not only addresses a significant gap in language resources but also contributes to advancing linguistic computing for a broader spectrum of low-resource languages. The integration of both automatic processes and human judgment ensures that the resulting stopword list effectively harmonizes computational efficiency with linguistic authenticity.