L3Cube-MahaSocialNER: A Social Media based Marathi NER Dataset and BERT models (2401.00170v1)
Abstract: This work introduces the L3Cube-MahaSocialNER dataset, the first and largest social media dataset specifically designed for Named Entity Recognition (NER) in the Marathi language. The dataset comprises 18,000 manually labeled sentences covering eight entity classes, addressing challenges posed by social media data, including non-standard language and informal idioms. Deep learning models, including CNN, LSTM, BiLSTM, and Transformer models, are evaluated on the individual dataset with IOB and non-IOB notations. The results demonstrate the effectiveness of these models in accurately recognizing named entities in Marathi informal text. The L3Cube-MahaSocialNER dataset offers user-centric information extraction and supports real-time applications, providing a valuable resource for public opinion analysis, news, and marketing on social media platforms. We also show that the zero-shot results of the regular NER model are poor on the social NER test set thus highlighting the need for more social NER datasets. The datasets and models are publicly available at https://github.com/l3cube-pune/MarathiNLP
- A Multi-task Approach for Named Entity Recognition in Social Media Data. In Proceedings of the 3rd Workshop on Noisy User-generated Text. Association for Computational Linguistics, Copenhagen, Denmark, 148–153. https://doi.org/10.18653/v1/W17-4419
- Named Entity Extraction using AdaBoost. 1–4. https://doi.org/10.3115/1118853.1118857
- Raviraj Joshi. 2022a. L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages. arXiv preprint arXiv:2211.11418.
- Raviraj Joshi. 2022b. L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources. In Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference. 97–101.
- Raviraj Joshi. 2022c. L3cube-mahanlp: Marathi natural language processing datasets, models, and library. arXiv preprint arXiv:2205.14728 (2022).
- Jason Jung. 2011. Towards Named Entity Recognition Method for Microtexts in Online Social Networks: A Case Study of Twitter. Expert Systems with Applications 39, 563–564. https://doi.org/10.1109/ASONAM.2011.39
- IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 4948–4961. https://doi.org/10.18653/v1/2020.findings-emnlp.445
- L3CubeMahaSent: A Marathi Tweet-based Sentiment Analysis Dataset. In Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. Association for Computational Linguistics, Online, 213–220. https://aclanthology.org/2021.wassa-1.23
- Multi-channel BiLSTM-CRF Model for Emerging Named Entity Recognition in Social Media. In Proceedings of the 3rd Workshop on Noisy User-generated Text. Association for Computational Linguistics, Copenhagen, Denmark, 160–165. https://doi.org/10.18653/v1/W17-4421
- Mono versus multilingual bert: A case study in hindi and marathi named entity recognition. In Proceedings of 3rd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications: ICMISC 2022. Springer, 607–618.
- L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT models. In Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 29–34. https://aclanthology.org/2022.wildre-1.6
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL]
- Akash Misal and Yashodhara Haribhakta. 2022. Transfer Learning for Marathi Named Entity Recognition. In 2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N). 1487–1491. https://doi.org/10.1109/ICAC3N56670.2022.10074266
- Nanyun Peng and Mark Dredze. 2015. Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 548–554. https://doi.org/10.18653/v1/D15-1064
- Enhancing Low Resource NER using Assisting Language and Transfer Learning. In 2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC). IEEE, 1666–1671.
- Harsh Chaudhari (13 papers)
- Anuja Patil (2 papers)
- Dhanashree Lavekar (2 papers)
- Pranav Khairnar (2 papers)
- Raviraj Joshi (76 papers)