My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks (2306.14030v2)

Published 24 Jun 2023 in cs.CL and cs.LG

Abstract: The research on code-mixed data is limited due to the unavailability of dedicated code-mixed datasets and pre-trained LLMs. In this work, we focus on the low-resource Indian language Marathi which lacks any prior work in code-mixing. We present L3Cube-MeCorpus, a large code-mixed Marathi-English (Mr-En) corpus with 10 million social media sentences for pretraining. We also release L3Cube-MeBERT and MeRoBERTa, code-mixed BERT-based transformer models pre-trained on MeCorpus. Furthermore, for benchmarking, we present three supervised datasets MeHate, MeSent, and MeLID for downstream tasks like code-mixed Mr-En hate speech detection, sentiment analysis, and language identification respectively. These evaluation datasets individually consist of manually annotated \url{~}12,000 Marathi-English code-mixed tweets. Ablations show that the models trained on this novel corpus significantly outperform the existing state-of-the-art BERT models. This is the first work that presents artifacts for code-mixed Marathi research. All datasets and models are publicly released at https://github.com/l3cube-pune/MarathiNLP .

PDF Abstract

Overview of "My Boli: Code-mixed Marathi-English Corpora, Pretrained LLMs and Evaluation Benchmarks"

The paper "My Boli: Code-mixed Marathi-English Corpora, Pretrained LLMs and Evaluation Benchmarks" addresses the linguistic complexity of code-mixed data, specifically the Marathi-English mixture, by presenting a comprehensive set of resources aimed at fostering NLP capabilities in this domain. The authors introduce several datasets and models tailored for code-mixed Marathi-English NLP tasks, setting a foundation for further research and applications in this evolving field.

Contributions

The paper makes the following key contributions:

Datasets Release: The authors have curated and released three supervised datasets (L3Cube-MeHate, MeLID, and MeSent) and one extensive unsupervised dataset (L3Cube-MeCorpus). The unsupervised dataset comprises 5 million code-mixed instances with an additional 5 million transliterated Devanagari sentences, representing a significant corpus for pre-training.
Pre-trained LLMs: Several BERT-based models specifically adapted for code-mixed Marathi-English text are introduced. These models include MeBERT, MeBERT-Mixed, MeBERT-Mixed-v2, and MeRoBERTa variants. These models were pre-trained on the 10 million sentence MeCorpus, benefiting from both Roman and Devanagari script data.
Benchmarks for Evaluation: To ensure robust evaluation, the paper presents benchmark datasets for tasks such as sentiment analysis, hate speech detection, and language identification. These datasets have been annotated by native Marathi speakers, ensuring high-quality labels.

Results and Observations

The experimental results presented indicate that the novel MeBERT-based models outperform existing state-of-the-art models in the given tasks. For the task of hate speech detection using the MeHate dataset, MeBERT-Mixed-v2 achieved the highest F1 score of 78.3%. Meanwhile, MeRoBERTa excelled in sentiment analysis, scoring 67.27% on the MeSent dataset, and MeBERT-Mixed-v2 also showed superior performance in language identification with an F1 score of 88.6%. These results demonstrate the efficacy of the newly trained models for handling code-mixed data more effectively than the established models.

Implications and Future Directions

The implications of this work are multi-fold. Practically, the availability of these datasets and models democratizes Marathi-English NLP by providing foundational resources to both academic and industrial researchers. This work supports content moderation and analysis on social media platforms, which is crucial given the vast presence of code-mixed language usage online.

Theoretically, the paper paves the way for further exploration into effective pre-training and fine-tuning methodologies for code-mixed languages. It encourages the development of robust transliteration techniques capable of handling the many variations in spelling inherent in code-mixed language.

Future research could enhance these aspects by expanding resources to encompass a broader scope of regional dialects and language variations, enriching pre-training datasets, and possibly integrating more sophisticated algorithms for nuanced tasks like sarcasm detection or contextual emotion recognition in code-mixed text.

In summary, this paper marks a substantial step towards the establishment of comprehensive NLP tools for code-mixed Marathi-English data, providing a valuable platform for subsequent advancements in the field.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Tanmay Chavan (11 papers)
Omkar Gokhale (6 papers)
Aditya Kane (14 papers)
Shantanu Patankar (8 papers)
Raviraj Joshi (76 papers)

Citations (2)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - l3cube-pune/MarathiNLP: Marathi NLP - is a repository dedicated to development of tools and resources for Marathi language. (114 stars)