Creating a morphological and syntactic tagged corpus for the Uzbek language

Published 27 Oct 2022 in cs.CL | (2210.15234v1)

Abstract: Nowadays, creation of the tagged corpora is becoming one of the most important tasks of NLP. There are not enough tagged corpora to build machine learning models for the low-resource Uzbek language. In this paper, we tried to fill that gap by developing a novel Part Of Speech (POS) and syntactic tagset for creating the syntactic and morphologically tagged corpus of the Uzbek language. This work also includes detailed description and presentation of a web-based application to work on a tagging as well. Based on the developed annotation tool and the software, we share our experience results of the first stage of the tagged corpus creation