Overview of "My Boli: Code-mixed Marathi-English Corpora, Pretrained LLMs and Evaluation Benchmarks"
The paper "My Boli: Code-mixed Marathi-English Corpora, Pretrained LLMs and Evaluation Benchmarks" addresses the linguistic complexity of code-mixed data, specifically the Marathi-English mixture, by presenting a comprehensive set of resources aimed at fostering NLP capabilities in this domain. The authors introduce several datasets and models tailored for code-mixed Marathi-English NLP tasks, setting a foundation for further research and applications in this evolving field.
Contributions
The paper makes the following key contributions:
- Datasets Release: The authors have curated and released three supervised datasets (L3Cube-MeHate, MeLID, and MeSent) and one extensive unsupervised dataset (L3Cube-MeCorpus). The unsupervised dataset comprises 5 million code-mixed instances with an additional 5 million transliterated Devanagari sentences, representing a significant corpus for pre-training.
- Pre-trained LLMs: Several BERT-based models specifically adapted for code-mixed Marathi-English text are introduced. These models include MeBERT, MeBERT-Mixed, MeBERT-Mixed-v2, and MeRoBERTa variants. These models were pre-trained on the 10 million sentence MeCorpus, benefiting from both Roman and Devanagari script data.
- Benchmarks for Evaluation: To ensure robust evaluation, the paper presents benchmark datasets for tasks such as sentiment analysis, hate speech detection, and language identification. These datasets have been annotated by native Marathi speakers, ensuring high-quality labels.
Results and Observations
The experimental results presented indicate that the novel MeBERT-based models outperform existing state-of-the-art models in the given tasks. For the task of hate speech detection using the MeHate dataset, MeBERT-Mixed-v2 achieved the highest F1 score of 78.3%. Meanwhile, MeRoBERTa excelled in sentiment analysis, scoring 67.27% on the MeSent dataset, and MeBERT-Mixed-v2 also showed superior performance in language identification with an F1 score of 88.6%. These results demonstrate the efficacy of the newly trained models for handling code-mixed data more effectively than the established models.
Implications and Future Directions
The implications of this work are multi-fold. Practically, the availability of these datasets and models democratizes Marathi-English NLP by providing foundational resources to both academic and industrial researchers. This work supports content moderation and analysis on social media platforms, which is crucial given the vast presence of code-mixed language usage online.
Theoretically, the paper paves the way for further exploration into effective pre-training and fine-tuning methodologies for code-mixed languages. It encourages the development of robust transliteration techniques capable of handling the many variations in spelling inherent in code-mixed language.
Future research could enhance these aspects by expanding resources to encompass a broader scope of regional dialects and language variations, enriching pre-training datasets, and possibly integrating more sophisticated algorithms for nuanced tasks like sarcasm detection or contextual emotion recognition in code-mixed text.
In summary, this paper marks a substantial step towards the establishment of comprehensive NLP tools for code-mixed Marathi-English data, providing a valuable platform for subsequent advancements in the field.