Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Text classification dataset and analysis for Uzbek language (2302.14494v1)

Published 28 Feb 2023 in cs.CL

Abstract: Text classification is an important task in NLP, where the goal is to categorize text data into predefined classes. In this study, we analyse the dataset creation steps and evaluation techniques of multi-label news categorisation task as part of text classification. We first present a newly obtained dataset for Uzbek text classification, which was collected from 10 different news and press websites and covers 15 categories of news, press and law texts. We also present a comprehensive evaluation of different models, ranging from traditional bag-of-words models to deep learning architectures, on this newly created dataset. Our experiments show that the Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) based models outperform the rule-based models. The best performance is achieved by the BERTbek model, which is a transformer-based BERT model trained on the Uzbek corpus. Our findings provide a good baseline for further research in Uzbek text classification.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Elmurod Kuriyozov (8 papers)
  2. Ulugbek Salaev (6 papers)
  3. Sanatbek Matlatipov (3 papers)
  4. Gayrat Matlatipov (2 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.