L3Cube-MahaSocialNER: A Social Media based Marathi NER Dataset and BERT models (2401.00170v1)

Published 30 Dec 2023 in cs.CL and cs.LG

Abstract: This work introduces the L3Cube-MahaSocialNER dataset, the first and largest social media dataset specifically designed for Named Entity Recognition (NER) in the Marathi language. The dataset comprises 18,000 manually labeled sentences covering eight entity classes, addressing challenges posed by social media data, including non-standard language and informal idioms. Deep learning models, including CNN, LSTM, BiLSTM, and Transformer models, are evaluated on the individual dataset with IOB and non-IOB notations. The results demonstrate the effectiveness of these models in accurately recognizing named entities in Marathi informal text. The L3Cube-MahaSocialNER dataset offers user-centric information extraction and supports real-time applications, providing a valuable resource for public opinion analysis, news, and marketing on social media platforms. We also show that the zero-shot results of the regular NER model are poor on the social NER test set thus highlighting the need for more social NER datasets. The datasets and models are publicly available at https://github.com/l3cube-pune/MarathiNLP

References (15)

Authors (5)

Harsh Chaudhari (13 papers)
Anuja Patil (2 papers)
Dhanashree Lavekar (2 papers)
Pranav Khairnar (2 papers)
Raviraj Joshi (76 papers)

Summary

Insights into L3Cube-MahaSocialNER: A Social Media-Based Marathi NER Dataset and BERT Models

The paper introduces a significant advancement in the domain of Named Entity Recognition (NER) for low-resource languages, specifically focusing on Marathi, through the creation of the L3Cube-MahaSocialNER dataset. This dataset stands out as the largest social media-specific NER dataset for the Marathi language, comprising approximately 18,000 manually labeled sentences. Its development addresses the linguistic and processing challenges posed by the idiosyncrasies of social media data, which include non-standard language usage, slang, and informal idioms.

Dataset and Methodology

The L3Cube-MahaSocialNER dataset has been carefully annotated to include eight distinct entity classes, employing both IOB and non-IOB tagging schemes. The dataset's annotations are congruent with existing non-social datasets to facilitate cross-domain analysis. This systematic annotation supports the evaluation and training of various deep learning models such as CNN, LSTM, BiLSTM, and several BERT-based transformers, including both monolingual and multilingual variants.

The methodology section of the paper explores numerous model architectures, notably highlighting the efficacy of BERT and its variants. A key outcome is the corroboration that fine-tuning existing NER models on domain-specific datasets like MahaSocialNER delivers superior performance. In particular, the models adapted to the social media domain exhibited enhanced ability to process informal and non-standard text compared to generic NER models.

Results and Observations

The experimentation results reveal that the model performances differ significantly when processing the social media dataset compared to traditional non-social datasets. Notably, the MahaNER-BERT model, fine-tuned on the L3Cube-MahaSocialNER dataset, achieved the highest F1 score, indicating its strong suitability for recognizing Marathi entities in social contexts. This suggests the critical role of transfer learning where existing models are fine-tuned on domain-specific datasets to improve applicability and performance.

Zero-shot experiments conducted using regular NER models on the social media dataset yielded relatively low F1 scores, accentuating the need for specialized datasets in diverse contexts. The highest performances in both the IOB and non-IOB schemes were attributed to models that combined pre-trained knowledge with targeted training on the new dataset, showcasing the robustness and adaptability of BERT-based architectures when fine-tuned properly.

Implications and Future Directions

The introduction of the L3Cube-MahaSocialNER dataset is a noteworthy contribution to the field of NER, especially for languages that are not as well-resourced as English. It facilitates advancements in real-time applications such as sentiment analysis, public opinion tracking, and trend analysis on platforms where Marathi is the primary medium of communication. The dataset and models also hold potential for integrating NER capabilities into applications aimed at user-centric information extraction, essential for personalized marketing and news aggregation services.

The research ushers in possibilities for further enhancing NER systems for other low-resource languages. By establishing a framework for creating domain-specific datasets and applying state-of-the-art model architectures, the work paves the way for subsequent studies to apply similar methodologies to additional languages and domains. As the NLP field continues to evolve, the focus could extend to exploring advanced training techniques and leveraging cross-linguistic resources to enhance the transfer learning process even further. Overall, this research sets a precedent for subsequent efforts in the development and refinement of language technologies for social media and informal text processing.

PDF Markdown

Related Papers

GitHub

GitHub - l3cube-pune/MarathiNLP: Marathi NLP - is a repository dedicated to development of tools and resources for Marathi language. (105 stars)