Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VNLP: Turkish NLP Package (2403.01309v1)

Published 2 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: In this work, we present VNLP: the first dedicated, complete, open-source, well-documented, lightweight, production-ready, state-of-the-art NLP package for the Turkish language. It contains a wide variety of tools, ranging from the simplest tasks, such as sentence splitting and text normalization, to the more advanced ones, such as text and token classification models. Its token classification models are based on "Context Model", a novel architecture that is both an encoder and an auto-regressive model. NLP tasks solved by VNLP models include but are not limited to Sentiment Analysis, Named Entity Recognition, Morphological Analysis & Disambiguation and Part-of-Speech Tagging. Moreover, it comes with pre-trained word embeddings and corresponding SentencePiece Unigram tokenizers. VNLP has an open-source GitHub repository, ReadtheDocs documentation, PyPi package for convenient installation, Python and command-line API and a demo page to test all the functionality. Consequently, our main contribution is a complete, compact, easy-to-install and easy-to-use NLP package for Turkish.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
  2. Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. arXiv e-prints, page arXiv:2201.06642.
  3. Akin. 2023. Zemberek-nlp. Online; accessed 07-Aug-2023.
  4. Turkishdelightnlp: A neural turkish nlp toolkit. ACL.
  5. Phil Culliton Amy Jang, Ana Sofia Uzsoy. 2020. Contradictory, my dear watson.
  6. Basturk. 2023. Yemeksepeti sentiment analysis. Online; accessed 07-Aug-2023.
  7. Batuhan Baykara and Tunga Güngör. 2022. Turkish abstractive text summarization using pretrained sequence-to-sequence models. Natural Language Engineering, pages 1–30.
  8. Bilen. 2023. Duygu analizi veri seti. Online; accessed 07-Aug-2023.
  9. Burhan BİLEN and Fahrettin HORASAN. 2021. Lstm network based sentiment analysis for customer reviews. Politeknik Dergisi, 25(3):959–966.
  10. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.".
  11. Enriching word vectors with subword information. Transactions of the association for computational linguistics, 5:135–146.
  12. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
  13. François Chollet et al. 2015. Keras. https://keras.io.
  14. Resources for turkish natural language processing: A critical survey. Language Resources and Evaluation, 57(1):449–488.
  15. Coskuner. 2023. Yemek sepeti comments. Online; accessed 07-Aug-2023.
  16. Morphnet: A sequence-to-sequence model that combines morphological analysis and disambiguation. CoRR, abs/1805.07946.
  17. Dbmdz. 2023. dbmdz/bert-base-turkish-uncased. Online; accessed 07-Aug-2023.
  18. Universal Dependencies. Computational Linguistics, 47(2):255–308.
  19. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  20. Gokmen. 2023. Turkish reviews dataset. Online; accessed 07-Aug-2023.
  21. Improving named entity recognition by jointly learning to disambiguate morphological tags. arXiv preprint arXiv:1807.06683.
  22. Guven. 2023. Turkish tweets dataset. Online; accessed 07-Aug-2023.
  23. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  24. Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear.
  25. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. CoRR, abs/2003.11080.
  26. Kahyaoglu. 2023. twitter-sentiment-analysis. Online; accessed 07-Aug-2023.
  27. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  28. Schroeder. Koehn. 2023. Text to sentence splitter. Online; accessed 07-Aug-2023.
  29. Named entity recognition on turkish tweets. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 450–454.
  30. A named entity recognition dataset for turkish. In 2016 24th Signal Processing and Communication Application Conference (SIU), pages 329–332. IEEE.
  31. Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
  32. Abdullatif Köksal and Arzucan Özgür. 2021. Twitter dataset and evaluation of transformers for turkish sentiment analysis. In 2021 29th Signal Processing and Communications Applications Conference (SIU).
  33. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  34. Filipp Ozinov. Jamspell. http://github.com/bakwc/JamSpell/. Archived: 14-Mar-2023.
  35. Ozler. 2023. 5k turkish tweets with incivil content. Online; accessed 07-Aug-2023.
  36. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958.
  37. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
  38. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
  39. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints.
  40. Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta. ELRA. http://is.muni.cz/publication/884893/en.
  41. Mukayese: Turkish nlp strikes back. In Findings of the Association for Computational Linguistics: ACL 2022, pages 846–863.
  42. Automatically annotated turkish corpus for named entity recognition and text categorization using large-scale gazetteers. arXiv preprint arXiv:1702.02363.
  43. On stopwords, filtering and data sparsity for sentiment analysis of twitter. Proceedings of the 9th International Language Resources and Evaluation Conference (LREC’14), pages 810–817.
  44. Sarigil. 2023. Turkish sales comments. Online; accessed 07-Aug-2023.
  45. Finding a" kneedle" in a haystack: Detecting knee points in system behavior. In 2011 31st international conference on distributed computing systems workshops, pages 166–171. IEEE.
  46. Sevinc. 2023. turkish-deasciifier: Turkish deasciifier. Online; accessed 07-Aug-2023.
  47. The role of context in neural morphological disambiguation. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 181–191.
  48. Motion representations for articulated animation.
  49. Subasi. 2023. turkish-tweets-sentiment-analysis. Online; accessed 07-Aug-2023.
  50. Teghub. 2023. Turkishnerdata3labels. Online; accessed 07-Aug-2023.
  51. A statistical information extraction system for turkish. Natural Language Engineering, 9(2):181–210.
  52. Tulap-an accessible and sustainable platform for turkish natural language processing resources. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 219–227.
  53. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  54. Yildiz. 2023. Lookupanalyzerdisambiguator. Online; accessed 07-Aug-2023.
  55. A morphology-aware network for morphological disambiguation. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1).
  56. Yilmaz. 2023. Turkish sentiment analysis. Online; accessed 07-Aug-2023.
  57. Selim Fırat Yilmaz. 2018. Bilkent turkish writings dataset. https://github.com/selimfirat/bilkent-turkish-writings-dataset. Archived: 20-Oct-2020.
  58. Deniz Yuret and Ferhan Ture. 2006. Learning morphological disambiguation rules for turkish. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 328–334, New York City, USA. Association for Computational Linguistics.
  59. Improving massively multilingual neural machine translation and zero-shot translation.
Citations (1)

Summary

We haven't generated a summary for this paper yet.