L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT models (2204.06029v1)

Published 12 Apr 2022 in cs.CL and cs.LG

Abstract: Named Entity Recognition (NER) is a basic NLP task and finds major applications in conversational and search systems. It helps us identify key entities in a sentence used for the downstream application. NER or similar slot filling systems for popular languages have been heavily used in commercial applications. In this work, we focus on Marathi, an Indian language, spoken prominently by the people of Maharashtra state. Marathi is a low resource language and still lacks useful NER resources. We present L3Cube-MahaNER, the first major gold standard named entity recognition dataset in Marathi. We also describe the manual annotation guidelines followed during the process. In the end, we benchmark the dataset on different CNN, LSTM, and Transformer based models like mBERT, XLM-RoBERTa, IndicBERT, MahaBERT, etc. The MahaBERT provides the best performance among all the models. The data and models are available at https://github.com/l3cube-pune/MarathiNLP .

PDF Abstract

L3Cube-MahaNER: Advancing Marathi Named Entity Recognition

The paper presents L3Cube-MahaNER, a Marathi Named Entity Recognition (NER) dataset and the associated BERT models. The focus is on establishing the largest and most substantial dataset for NER in the Marathi language, addressing a notable gap in resources for this language, which is classified as low-resource. The dataset introduces a comprehensive annotation scheme tailored for Marathi NER, extending the existing resources significantly in both volume and diversity of entity categories.

Dataset Development and Annotation

The dataset, L3Cube-MahaNER, consists of 25,000 sentences manually annotated with eight named entity classes: Person (NEP), Location (NEL), Organization (NEO), Measure (NEM), Time (NETI), Date (NED), and Designation (ED). Derived from news domain corpora, the annotation process adhered to strict guidelines to ensure consistency across the data. The sentences were annotated using both IOB and non-IOB formats, significantly enhancing the dataset's utility for a variety of NER architectures and methodologies.

Experimental Framework and Results

A range of NLP models, particularly deep learning architectures, were employed to benchmark the L3Cube-MahaNER dataset:

Traditional Models: Convolutional Neural Networks (CNN), Long Short-Term Memory networks (LSTM), and bidirectional LSTM (biLSTM) were deployed, with biLSTM showing superior performance.
Transformer-Based Models: Various architectures such as mBERT, IndicBERT, XLM-RoBERTa, and language-specific models like MahaBERT, MahaRoBERTa, and MahaALBERT were evaluated. Among these, MahaBERT demonstrated the highest performance for non-IOB notations, while MahaRoBERTa excelled in IOB notations, indicating the potential of language-specific transformers in low-resource language settings.

The results, meticulously itemized in the paper, show a considerable improvement over previous models used for Marathi NER, thus setting a new benchmark and paving the way for future enhancements in processing the Marathi language computationally.

Implications and Future Directions

The introduction of L3Cube-MahaNER fills a critical void in the resources available for Marathi, contributing to the advancement of NER tasks. The comprehensive size and quality of this dataset enable more precise modeling and the exploration of innovative architecture improvements.

The paper's emphasis on robust dataset annotation processes and its availability for open research showcase a commitment to reproducibility and community engagement. These efforts are likely to encourage further research in Marathi NLP, potentially increasing the accuracy and applicability of Marathi LLMs in various real-world scenarios, such as digital assistants, search engines, and automated customer service systems.

Future work could explore expanding the dataset's annotations to include contextually rich knowledge graphs or integrating it with cross-lingual transfer learning techniques to further enhance model performance in resource-constrained settings. Additionally, leveraging sophisticated architectures like attention-based transformers in more nuanced ways could be investigated to push the boundaries of what is achievable in Marathi NER.

In conclusion, L3Cube-MahaNER represents a significant contribution to the field of Named Entity Recognition for the Marathi language, offering a robust foundation for subsequent research and application development. The open accessibility to both the dataset and the model resources further enriches the potential for collaborative innovation in the domain of Indic language processing.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Parth Patil (5 papers)
Aparna Ranade (3 papers)
Maithili Sabane (4 papers)
Onkar Litake (11 papers)
Raviraj Joshi (76 papers)

Citations (18)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - l3cube-pune/MarathiNLP: Marathi NLP - is a repository dedicated to development of tools and resources for Marathi language. (105 stars)