KazNERD: Kazakh Named Entity Recognition Dataset (2111.13419v2)

Published 26 Nov 2021 in cs.CL and cs.IR

Abstract: We present the development of a dataset for Kazakh named entity recognition. The dataset was built as there is a clear need for publicly available annotated corpora in Kazakh, as well as annotation guidelines containing straightforward--but rigorous--rules and examples. The dataset annotation, based on the IOB2 scheme, was carried out on television news text by two native Kazakh speakers under the supervision of the first author. The resulting dataset contains 112,702 sentences and 136,333 annotations for 25 entity classes. State-of-the-art machine learning models to automatise Kazakh named entity recognition were also built, with the best-performing model achieving an exact match F1-score of 97.22% on the test set. The annotated dataset, guidelines, and codes used to train the models are freely available for download under the CC BY 4.0 licence from https://github.com/IS2AI/KazNERD.

Citations (14)

View on Semantic Scholar

Summary

The paper introduces a novel Kazakh NER dataset with 112,702 TV news sentences and 136,333 annotations across 25 entity categories.
It details the annotation process using the IOB2 scheme by native speakers to ensure high-quality, reproducible dataset splits for training and evaluation.
Evaluation with state-of-the-art models, notably XLM-RoBERTa, achieved a 97.22% F1-score, showcasing the dataset’s potential for low-resource language NLP.

Overview of the KazNERD: Kazakh Named Entity Recognition Dataset

The paper "KazNERD: Kazakh Named Entity Recognition Dataset" presents the development of a significant resource for named entity recognition (NER) in the Kazakh language. Recognizing the underrepresentation of Kazakh in digital form and the lack of annotated corpora, this work addresses a crucial gap in the field by providing a comprehensive dataset annotated for NER.

Development and Structure of KazNERD

KazNERD was constructed by meticulously annotating 112,702 sentences from television news text. Two native Kazakh speakers, under the supervision of the first author, conducted the annotation process using the IOB2 scheme. This effort resulted in 136,333 annotations spanning 25 distinct named entity categories, which range from CARDINAL and DATE to fine-grained categories like NON_HUMAN and ADAGE. Each annotation was guided by specific protocols tailored for the Kazakh language, which are freely available under the CC BY 4.0 license.

To ensure robust applicability, the dataset includes various sentence representations (AID, BID, CID, DID, EID, FID) to facilitate model recognition of named entities in different typographic forms. This dataset is split into training, validation, and test sets, with a deliberate balance in representation to support reproducible NER experiments.

Evaluation with State-of-the-Art Models

The authors trained several NER models, including a CRF, BiLSTM-CNN-CRF, multilingual BERT, and XLM-RoBERTa, using the KazNERD corpus. Notably, the XLM-RoBERTa model achieved the highest performance, with an F $_1$ -score of 97.22% on the test set, underscoring the suitability and quality of the dataset for developing effective NER models in Kazakh. This result demonstrates the high potential of transfer learning approaches, given the syntactic and morphological specificity of the Kazakh language.

Challenges and Future Directions

The paper discusses various challenges encountered during dataset development and annotation. These include handling non-standard capitalisation in Kazakh proper nouns and dealing with complexities of NE coordination and nested entities. Such discussions provide valuable insights into the intricacies of NER for agglutinative languages.

Several categories like NON_HUMAN and ADAGE posed challenges due to paucity and variability, respectively. Future expansions of KazNERD with data from diverse genres might mitigate these issues by providing a more balanced representation of such entities. Moreover, investigating more domain-independent NER models could further enhance the dataset's applicability.

Implications and Contributions

KazNERD fills a significant void in publicly available resources for low-resource languages, facilitating research in Kazakh and potentially offering methods transferrable to other Turkic languages. The successful application of modern transfer learning models exemplified by the results validates the applicability of advanced NER techniques to less-resourced languages, encouraging broader implications for AI-driven linguistic research. As digital text processing technologies evolve, resources like KazNERD will be instrumental in advancing multi-language NER systems and could play a pivotal role in NLP advancements in Central Asia and beyond.

PDF Markdown

Related Papers

GitHub

GitHub - IS2AI/KazNERD: An open-source Kazakh named entity recognition dataset (KazNERD), annotation guidelines, and baseline NER models. (23 stars)

YouTube

Show All Videos