Fine-tuning Transformer-based Encoder for Turkish Language Understanding Tasks (2401.17396v1)
Abstract: Deep learning-based and lately Transformer-based LLMs have been dominating the studies of natural language processing in the last years. Thanks to their accurate and fast fine-tuning characteristics, they have outperformed traditional machine learning-based approaches and achieved state-of-the-art results for many challenging natural language understanding (NLU) problems. Recent studies showed that the Transformer-based models such as BERT, which is Bidirectional Encoder Representations from Transformers, have reached impressive achievements on many tasks. Moreover, thanks to their transfer learning capacity, these architectures allow us to transfer pre-built models and fine-tune them to specific NLU tasks such as question answering. In this study, we provide a Transformer-based model and a baseline benchmark for the Turkish Language. We successfully fine-tuned a Turkish BERT model, namely BERTurk that is trained with base settings, to many downstream tasks and evaluated with a the Turkish Benchmark dataset. We showed that our studies significantly outperformed other existing baseline approaches for Named-Entity Recognition, Sentiment Analysis, Question Answering and Text Classification in Turkish Language. We publicly released these four fine-tuned models and resources in reproducibility and with the view of supporting other Turkish researchers and applications.
- Enriching Word Vectors with Subword Information. arXiv preprint arXiv:1607.04606 (2016).
- Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020).
- Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. CoRR abs/1901.02860 (2019). arXiv:1901.02860 http://arxiv.org/abs/1901.02860
- H. Demir and A. Özgür. 2014. Improving Named Entity Recognition for Morphologically Rich Languages Using Word Embeddings. In 2014 13th International Conference on Machine Learning and Applications. 117–122.
- Erkin Demirtas and Mykola Pechenizkiy. 2013. Cross-Lingual Polarity Detection with Machine Translation. In Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining (Chicago, Illinois) (WISDOM ’13). Association for Computing Machinery, New York, NY, USA, Article 9, 8 pages. https://doi.org/10.1145/2502069.2502078
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
- A. Hayran and M. Sert. 2017. Analysis on Microblog Data based on Word Embedding and Fusion Techniques. In 2017 25th Signal Processing and Communication Application Conference (SIU). tbd. https://doi.org/tbd
- BERT for Coreference Resolution: Baselines and Analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 5803–5808. https://doi.org/10.18653/v1/D19-1588
- Named Entity Recognition on Turkish Tweets. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), Reykjavik, Iceland, 450–454. http://www.lrec-conf.org/proceedings/lrec2014/pdf/380_Paper.pdf
- TTC-3600: A new benchmark dataset for Turkish text categorization. Journal of Information Science 43, 2 (2017), 174–185. https://doi.org/10.1177/0165551515620551
- Guillaume Lample and Alexis Conneau. 2019. Cross-lingual Language Model Pretraining. CoRR abs/1901.07291 (2019). arXiv:1901.07291 http://arxiv.org/abs/1901.07291
- ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. ArXiv abs/1909.11942 (2020).
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692
- Efficient Estimation of Word Representations in Vector Space. CoRR abs/1301.3781 (2013).
- Cross-lingual Name Tagging and Linking for 282 Languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 1946–1958. https://doi.org/10.18653/v1/P17-1178
- Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1532–1543. https://doi.org/10.3115/v1/D14-1162
- Deep contextualized word representations. CoRR abs/1802.05365 (2018). arXiv:1802.05365 http://arxiv.org/abs/1802.05365
- Language Models are Unsupervised Multitask Learners.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. ArXiv abs/1910.10683 (2019).
- Stefan Schweter. 2020. BERTurk - BERT models for Turkish. https://doi.org/10.5281/zenodo.3770924
- Gökhan Akın Şeker and Gülşen Eryiğit. 2012. Initial explorations on using CRFs for Turkish named entity recognition. In Proceedings of COLING 2012. 2459–2474.
- A statistical information extraction system for Turkish. Natural Language Engineering 9 (06 2003), 181 – 210. https://doi.org/10.1017/S135132490200284X
- Attention Is All You Need. CoRR abs/1706.03762 (2017). arXiv:1706.03762 http://arxiv.org/abs/1706.03762
- GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Brussels, Belgium, 353–355. https://doi.org/10.18653/v1/W18-5446
- Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR abs/1609.08144 (2016). arXiv:1609.08144 http://arxiv.org/abs/1609.08144
- XLNet: Generalized Autoregressive Pretraining for Language Understanding. (06 2019).
- Reyyan Yeniterzi. 2011. Exploiting Morphology in Turkish Named Entity Recognition System. In Proceedings of the ACL 2011 Student Session. Association for Computational Linguistics, Portland, OR, USA, 105–110. https://www.aclweb.org/anthology/P11-3019
- Turkish Named-Entity Recognition. Springer International Publishing, Cham, 115–132. https://doi.org/10.1007/978-3-319-90165-7_6
- Savaş Yildirim. 2014. A Knowledge-Poor Approach to Turkish Text Categorization. In Proceedings of the 15th International Conference on Computational Linguistics and Intelligent Text Processing - Volume 8404 (Kathmandu, Nepal) (CICLing 2014). Springer-Verlag, Berlin, Heidelberg, 428–440. https://doi.org/10.1007/978-3-642-54903-8_36
- Savaş Yildirim. 2020. Comparing Deep Neural Networks to Traditional Models for Sentiment Analysis in Turkish Language. Springer Singapore, Singapore, 311–319.
- Machine Reading Comprehension: The Role of Contextualized Language Models and Beyond. ArXiv abs/2005.06249 (2020).
- Gökhan Şeker and Gülşen Eryiğit. 2017. Extending a CRF-based named entity recognition model for Turkish well formed text and user generated content1. Semantic Web 8 (01 2017), 1–18. https://doi.org/10.3233/SW-170253
- Savas Yildirim (6 papers)