Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

L3Cube-MahaSocialNER: A Social Media based Marathi NER Dataset and BERT models (2401.00170v1)

Published 30 Dec 2023 in cs.CL and cs.LG

Abstract: This work introduces the L3Cube-MahaSocialNER dataset, the first and largest social media dataset specifically designed for Named Entity Recognition (NER) in the Marathi language. The dataset comprises 18,000 manually labeled sentences covering eight entity classes, addressing challenges posed by social media data, including non-standard language and informal idioms. Deep learning models, including CNN, LSTM, BiLSTM, and Transformer models, are evaluated on the individual dataset with IOB and non-IOB notations. The results demonstrate the effectiveness of these models in accurately recognizing named entities in Marathi informal text. The L3Cube-MahaSocialNER dataset offers user-centric information extraction and supports real-time applications, providing a valuable resource for public opinion analysis, news, and marketing on social media platforms. We also show that the zero-shot results of the regular NER model are poor on the social NER test set thus highlighting the need for more social NER datasets. The datasets and models are publicly available at https://github.com/l3cube-pune/MarathiNLP

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)
  1. A Multi-task Approach for Named Entity Recognition in Social Media Data. In Proceedings of the 3rd Workshop on Noisy User-generated Text. Association for Computational Linguistics, Copenhagen, Denmark, 148–153. https://doi.org/10.18653/v1/W17-4419
  2. Named Entity Extraction using AdaBoost. 1–4. https://doi.org/10.3115/1118853.1118857
  3. Raviraj Joshi. 2022a. L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages. arXiv preprint arXiv:2211.11418.
  4. Raviraj Joshi. 2022b. L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources. In Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference. 97–101.
  5. Raviraj Joshi. 2022c. L3cube-mahanlp: Marathi natural language processing datasets, models, and library. arXiv preprint arXiv:2205.14728 (2022).
  6. Jason Jung. 2011. Towards Named Entity Recognition Method for Microtexts in Online Social Networks: A Case Study of Twitter. Expert Systems with Applications 39, 563–564. https://doi.org/10.1109/ASONAM.2011.39
  7. IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 4948–4961. https://doi.org/10.18653/v1/2020.findings-emnlp.445
  8. L3CubeMahaSent: A Marathi Tweet-based Sentiment Analysis Dataset. In Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. Association for Computational Linguistics, Online, 213–220. https://aclanthology.org/2021.wassa-1.23
  9. Multi-channel BiLSTM-CRF Model for Emerging Named Entity Recognition in Social Media. In Proceedings of the 3rd Workshop on Noisy User-generated Text. Association for Computational Linguistics, Copenhagen, Denmark, 160–165. https://doi.org/10.18653/v1/W17-4421
  10. Mono versus multilingual bert: A case study in hindi and marathi named entity recognition. In Proceedings of 3rd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications: ICMISC 2022. Springer, 607–618.
  11. L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT models. In Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 29–34. https://aclanthology.org/2022.wildre-1.6
  12. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL]
  13. Akash Misal and Yashodhara Haribhakta. 2022. Transfer Learning for Marathi Named Entity Recognition. In 2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N). 1487–1491. https://doi.org/10.1109/ICAC3N56670.2022.10074266
  14. Nanyun Peng and Mark Dredze. 2015. Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 548–554. https://doi.org/10.18653/v1/D15-1064
  15. Enhancing Low Resource NER using Assisting Language and Transfer Learning. In 2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC). IEEE, 1666–1671.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Harsh Chaudhari (13 papers)
  2. Anuja Patil (2 papers)
  3. Dhanashree Lavekar (2 papers)
  4. Pranav Khairnar (2 papers)
  5. Raviraj Joshi (76 papers)

Summary

Insights into L3Cube-MahaSocialNER: A Social Media-Based Marathi NER Dataset and BERT Models

The paper introduces a significant advancement in the domain of Named Entity Recognition (NER) for low-resource languages, specifically focusing on Marathi, through the creation of the L3Cube-MahaSocialNER dataset. This dataset stands out as the largest social media-specific NER dataset for the Marathi language, comprising approximately 18,000 manually labeled sentences. Its development addresses the linguistic and processing challenges posed by the idiosyncrasies of social media data, which include non-standard language usage, slang, and informal idioms.

Dataset and Methodology

The L3Cube-MahaSocialNER dataset has been carefully annotated to include eight distinct entity classes, employing both IOB and non-IOB tagging schemes. The dataset's annotations are congruent with existing non-social datasets to facilitate cross-domain analysis. This systematic annotation supports the evaluation and training of various deep learning models such as CNN, LSTM, BiLSTM, and several BERT-based transformers, including both monolingual and multilingual variants.

The methodology section of the paper explores numerous model architectures, notably highlighting the efficacy of BERT and its variants. A key outcome is the corroboration that fine-tuning existing NER models on domain-specific datasets like MahaSocialNER delivers superior performance. In particular, the models adapted to the social media domain exhibited enhanced ability to process informal and non-standard text compared to generic NER models.

Results and Observations

The experimentation results reveal that the model performances differ significantly when processing the social media dataset compared to traditional non-social datasets. Notably, the MahaNER-BERT model, fine-tuned on the L3Cube-MahaSocialNER dataset, achieved the highest F1 score, indicating its strong suitability for recognizing Marathi entities in social contexts. This suggests the critical role of transfer learning where existing models are fine-tuned on domain-specific datasets to improve applicability and performance.

Zero-shot experiments conducted using regular NER models on the social media dataset yielded relatively low F1 scores, accentuating the need for specialized datasets in diverse contexts. The highest performances in both the IOB and non-IOB schemes were attributed to models that combined pre-trained knowledge with targeted training on the new dataset, showcasing the robustness and adaptability of BERT-based architectures when fine-tuned properly.

Implications and Future Directions

The introduction of the L3Cube-MahaSocialNER dataset is a noteworthy contribution to the field of NER, especially for languages that are not as well-resourced as English. It facilitates advancements in real-time applications such as sentiment analysis, public opinion tracking, and trend analysis on platforms where Marathi is the primary medium of communication. The dataset and models also hold potential for integrating NER capabilities into applications aimed at user-centric information extraction, essential for personalized marketing and news aggregation services.

The research ushers in possibilities for further enhancing NER systems for other low-resource languages. By establishing a framework for creating domain-specific datasets and applying state-of-the-art model architectures, the work paves the way for subsequent studies to apply similar methodologies to additional languages and domains. As the NLP field continues to evolve, the focus could extend to exploring advanced training techniques and leveraging cross-linguistic resources to enhance the transfer learning process even further. Overall, this research sets a precedent for subsequent efforts in the development and refinement of language technologies for social media and informal text processing.