KazSAnDRA: Kazakh Sentiment Analysis Dataset of Reviews and Attitudes (2403.19335v2)
Abstract: This paper presents KazSAnDRA, a dataset developed for Kazakh sentiment analysis that is the first and largest publicly available dataset of its kind. KazSAnDRA comprises an extensive collection of 180,064 reviews obtained from various sources and includes numerical ratings ranging from 1 to 5, providing a quantitative representation of customer attitudes. The study also pursued the automation of Kazakh sentiment classification through the development and evaluation of four machine learning models trained for both polarity classification and score classification. Experimental analysis included evaluation of the results considering both balanced and imbalanced scenarios. The most successful model attained an F1-score of 0.81 for polarity classification and 0.39 for score classification on the test sets. The dataset and fine-tuned models are open access and available for download under the Creative Commons Attribution 4.0 International License (CC BY 4.0) through our GitHub repository.
- Y. B. Abdullin and V. V. Ivanov. 2017. Deep Learning Model for Bilingual Sentiment Classification of Short Texts. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 17(1):129–136.
- Gavin Abercrombie and Riza Batista-Navarro. 2020. ParlVote: A Corpus for Sentiment Analysis of Political Debates. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5073–5078, Marseille, France. European Language Resources Association.
- Cryptocurrency Price Prediction Using Tweet Volumes and Sentiment Analysis. SMU Data Science Review, 1(3):1–21.
- Mohamed Aly and Amir Atiya. 2013. LABR: A Large Scale Arabic Book Reviews Dataset. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 494–498, Sofia, Bulgaria. Association for Computational Linguistics.
- Methods for Analyzing Polarity of the Kazakh Texts Related to the Terrorist Threats. In Computational Science and Its Applications – ICCSA 2019, pages 717–730, Cham. Springer International Publishing.
- Sentiment Analysis of Customer Reviews: Balanced versus Unbalanced Datasets. In Knowledge-Based and Intelligent Information and Engineering Systems: 15th International Conference, KES 2011, Kaiserslautern, Germany, September 12-14, 2011, Proceedings, Part I 15, pages 161–170. Springer.
- Rethinking Embedding Coupling in Pre-trained Language Models. In International Conference on Learning Representations.
- Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the Association for Computational Linguistics (ACL), pages 8440–8451. Association for Computational Linguistics.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
- Xing Fang and Justin Zhan. 2015. Sentiment analysis using product review data. Journal of Big Data, 2(1):1–14.
- Rama Rohit Reddy Gangula and Radhika Mamidi. 2018. Resource Creation Towards Automated Sentiment Analysis in Telugu (a low resource language) and Integrating Multiple Domain Sources to Enhance Sentiment Prediction. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
- Dinara Gimadi. 2021. Web-sentiment Analysis of Public Comments (Public Reviews) for Languages with Limited Resources such as the Kazakh Language. Proceedings of the Student Research Workshop Associated with RANLP 2021.
- Twitter Sentiment Classification using Distant Supervision. CS224N project report, Stanford, 1(12):2009.
- Harnessing the cloud of patient experience: using social media to detect poor quality healthcare. BMJ Quality & Safety, 22(3):251–255.
- Daniel Jurafsky and James H. Martin. 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
- Sentiment Analysis for Low Resource Languages: A Study on Informal Indonesian Tweets. In Proceedings of the 12th Workshop on Asian Language Resources (ALR12), pages 123–131, Osaka, Japan. The COLING 2016 Organizing Committee.
- Bing Liu. 2012. Sentiment Analysis: A Fascinating Problem, pages 1–8. Springer International Publishing, Cham.
- Multilingual Denoising Pre-training for Neural Machine Translation. Transactions of the Association for Computational Linguistics, 8:726–742.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692.
- HindiMD: A Multi-domain Corpora for Low-resource Sentiment Analysis. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 7061–7070, Marseille, France. European Language Resources Association.
- Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal, 5(4):1093–1113.
- KSC2: An Industrial-Scale Open-Source Kazakh Speech Corpus. In Proc. Interspeech 2022, pages 1367–1371.
- "KazakhTTS2: Extending the open-source Kazakh TTS corpus with more data, speakers, and topics"missing. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5404–5411, Marseille, France. European Language Resources Association.
- Multi-Class Sentiment Analysis of Social Media Data with Machine Learning Algorithms. Computers, Materials & Continua, 69(1):913–930.
- Sergazy Sakenovich Narynov and Arman Serikuly Zharmagambetov. 2016. On One Approach of Solving Sentiment Analysis Task for Kazakh and Russian Languages Using Deep Learning. In Computational Collective Intelligence: 8th International Conference, ICCCI 2016, Halkidiki, Greece, September 28-30, 2016. Proceedings, Part II 8, pages 537–545. Springer.
- Sentiment Analysis of Reviews in Kazakh With Transfer Learning Techniques. In 2022 International Conference on Smart Information Systems and Technologies (SIST), pages 1–6.
- Kazakh Text Generation using Neural Bag-of-Words Model for Sentiment Analysis. Southeast Europe Journal of Soft Computing, 11(2):29–39.
- Aneta Pavlenko. 2008. Russian in post-Soviet countries. Russian Linguistics, 32(1):59–80.
- Dauren Rakhymzhanov. 2022. An Approach to the Study of Implementation of Kazakh Slang Dictionary for Better Sentiment Analysis in Kazakh. Prospects and Key Tendencies of Science in Contemporary World.
- SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowledge and Information Systems, 33(2):245–265.
- Yuliya Vladimirovna Rubtsova and Yury Alekseevich Zagorulko. 2014. An Approach to Construction and Analysis of a Corpus of Short Russian Texts Intended to Train a Sentiment Classifier. The Bulletin of NCC, 37:107–116.
- Sergey Smetanin and Mikhail Komarov. 2021. Deep Transfer Learning Baselines for Sentiment Analysis in Russian. Inf. Process. Manage., 58(3).
- Multilingual Translation with Extensible Multilingual Pretraining and Finetuning.
- KOHTD: Kazakh offline handwritten text dataset. Signal Processing: Image Communication, 108:116827.
- Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Yiming Yang. 2001. A Study on Thresholding Strategies for Text Categorization. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 137–145.
- Sentiment Analysis on the Hotel Reviews in the Kazakh Language. In 2017 International Conference on Computer Science and Engineering (UBMK), pages 790–794.
- Sentiment Analysis of Kazakh Text and Their Polarity. Web Intell., 17:9–15.
- KazNERD: Kazakh Named Entity Recognition Dataset. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 417–426, Marseille, France. European Language Resources Association.
- Lotfi A. Zadeh. 1996. Fuzzy logic = computing with words. IEEE Trans. Fuzzy Syst., 4:103–111.
- Deep learning for sentiment analysis: A survey. WIREs Data Mining and Knowledge Discovery, 8(4):e1253.
- Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics, 1(1):43–52.