Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fine-tuning Transformer-based Encoder for Turkish Language Understanding Tasks (2401.17396v1)

Published 30 Jan 2024 in cs.CL and cs.AI

Abstract: Deep learning-based and lately Transformer-based LLMs have been dominating the studies of natural language processing in the last years. Thanks to their accurate and fast fine-tuning characteristics, they have outperformed traditional machine learning-based approaches and achieved state-of-the-art results for many challenging natural language understanding (NLU) problems. Recent studies showed that the Transformer-based models such as BERT, which is Bidirectional Encoder Representations from Transformers, have reached impressive achievements on many tasks. Moreover, thanks to their transfer learning capacity, these architectures allow us to transfer pre-built models and fine-tune them to specific NLU tasks such as question answering. In this study, we provide a Transformer-based model and a baseline benchmark for the Turkish Language. We successfully fine-tuned a Turkish BERT model, namely BERTurk that is trained with base settings, to many downstream tasks and evaluated with a the Turkish Benchmark dataset. We showed that our studies significantly outperformed other existing baseline approaches for Named-Entity Recognition, Sentiment Analysis, Question Answering and Text Classification in Turkish Language. We publicly released these four fine-tuned models and resources in reproducibility and with the view of supporting other Turkish researchers and applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Enriching Word Vectors with Subword Information. arXiv preprint arXiv:1607.04606 (2016).
  2. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020).
  3. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. CoRR abs/1901.02860 (2019). arXiv:1901.02860 http://arxiv.org/abs/1901.02860
  4. H. Demir and A. Özgür. 2014. Improving Named Entity Recognition for Morphologically Rich Languages Using Word Embeddings. In 2014 13th International Conference on Machine Learning and Applications. 117–122.
  5. Erkin Demirtas and Mykola Pechenizkiy. 2013. Cross-Lingual Polarity Detection with Machine Translation. In Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining (Chicago, Illinois) (WISDOM ’13). Association for Computing Machinery, New York, NY, USA, Article 9, 8 pages. https://doi.org/10.1145/2502069.2502078
  6. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
  7. A. Hayran and M. Sert. 2017. Analysis on Microblog Data based on Word Embedding and Fusion Techniques. In 2017 25th Signal Processing and Communication Application Conference (SIU). tbd. https://doi.org/tbd
  8. BERT for Coreference Resolution: Baselines and Analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 5803–5808. https://doi.org/10.18653/v1/D19-1588
  9. Named Entity Recognition on Turkish Tweets. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), Reykjavik, Iceland, 450–454. http://www.lrec-conf.org/proceedings/lrec2014/pdf/380_Paper.pdf
  10. TTC-3600: A new benchmark dataset for Turkish text categorization. Journal of Information Science 43, 2 (2017), 174–185. https://doi.org/10.1177/0165551515620551
  11. Guillaume Lample and Alexis Conneau. 2019. Cross-lingual Language Model Pretraining. CoRR abs/1901.07291 (2019). arXiv:1901.07291 http://arxiv.org/abs/1901.07291
  12. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. ArXiv abs/1909.11942 (2020).
  13. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692
  14. Efficient Estimation of Word Representations in Vector Space. CoRR abs/1301.3781 (2013).
  15. Cross-lingual Name Tagging and Linking for 282 Languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 1946–1958. https://doi.org/10.18653/v1/P17-1178
  16. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1532–1543. https://doi.org/10.3115/v1/D14-1162
  17. Deep contextualized word representations. CoRR abs/1802.05365 (2018). arXiv:1802.05365 http://arxiv.org/abs/1802.05365
  18. Language Models are Unsupervised Multitask Learners.
  19. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. ArXiv abs/1910.10683 (2019).
  20. Stefan Schweter. 2020. BERTurk - BERT models for Turkish. https://doi.org/10.5281/zenodo.3770924
  21. Gökhan Akın Şeker and Gülşen Eryiğit. 2012. Initial explorations on using CRFs for Turkish named entity recognition. In Proceedings of COLING 2012. 2459–2474.
  22. A statistical information extraction system for Turkish. Natural Language Engineering 9 (06 2003), 181 – 210. https://doi.org/10.1017/S135132490200284X
  23. Attention Is All You Need. CoRR abs/1706.03762 (2017). arXiv:1706.03762 http://arxiv.org/abs/1706.03762
  24. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Brussels, Belgium, 353–355. https://doi.org/10.18653/v1/W18-5446
  25. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR abs/1609.08144 (2016). arXiv:1609.08144 http://arxiv.org/abs/1609.08144
  26. XLNet: Generalized Autoregressive Pretraining for Language Understanding. (06 2019).
  27. Reyyan Yeniterzi. 2011. Exploiting Morphology in Turkish Named Entity Recognition System. In Proceedings of the ACL 2011 Student Session. Association for Computational Linguistics, Portland, OR, USA, 105–110. https://www.aclweb.org/anthology/P11-3019
  28. Turkish Named-Entity Recognition. Springer International Publishing, Cham, 115–132. https://doi.org/10.1007/978-3-319-90165-7_6
  29. Savaş Yildirim. 2014. A Knowledge-Poor Approach to Turkish Text Categorization. In Proceedings of the 15th International Conference on Computational Linguistics and Intelligent Text Processing - Volume 8404 (Kathmandu, Nepal) (CICLing 2014). Springer-Verlag, Berlin, Heidelberg, 428–440. https://doi.org/10.1007/978-3-642-54903-8_36
  30. Savaş Yildirim. 2020. Comparing Deep Neural Networks to Traditional Models for Sentiment Analysis in Turkish Language. Springer Singapore, Singapore, 311–319.
  31. Machine Reading Comprehension: The Role of Contextualized Language Models and Beyond. ArXiv abs/2005.06249 (2020).
  32. Gökhan Şeker and Gülşen Eryiğit. 2017. Extending a CRF-based named entity recognition model for Turkish well formed text and user generated content1. Semantic Web 8 (01 2017), 1–18. https://doi.org/10.3233/SW-170253
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Savas Yildirim (6 papers)
Citations (4)