Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VBART: The Turkish LLM (2403.01308v2)

Published 2 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: We present VBART, the first Turkish sequence-to-sequence LLMs pre-trained on a large corpus from scratch. VBART are compact LLMs based on good ideas leveraged from BART and mBART models and come in two sizes, Large and XLarge. Fine-tuned VBART models surpass the prior state-of-the-art results in abstractive text summarization, title generation, text paraphrasing, question answering and question generation tasks. They allow fine-tuning for future text generation tasks and datasets, carving a new path for Turkish NLP research. Our work shows that having a pre-trained LLM for Turkish outperforms up to 3x multilingual models, improving existing results and providing efficient models for training and inference. Moreover, we show that our monolingual tokenizer is up to 11x more efficient than multilingual tokenizers. Last but not least, we introduce a method to enlarge an existing pre-trained LLM and question the relevancy of Chinchilla Scaling Law to sequence-to-sequence masked LLMs. Our fine-tuned models, tokenizer and cleaned vngrs-web-corpus of 135 GB are publicly available at huggingface.co/vngrs-ai.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
  2. Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. arXiv e-prints, page arXiv:2201.06642.
  3. Automated question generation and question answering from Turkish texts. Turkish Journal of Electrical Engineering and Computer Sciences.
  4. Semantic similarity based filtering for Turkish paraphrase dataset creation. In Proceedings of the 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022), pages 119–127, Trento, Italy. Association for Computational Linguistics.
  5. An evaluation of recent neural sequence tagging models in turkish named entity recognition. Expert Systems with Applications, 182:115049.
  6. On the cross-lingual transferability of monolingual representations. CoRR, abs/1910.11856.
  7. Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610.
  8. Batuhan Baykara and Tunga Güngör. 2022a. Abstractive text summarization and new large-scale datasets for agglutinative languages turkish and hungarian. Language Resources and Evaluation, 56(3):973–1007.
  9. Batuhan Baykara and Tunga Güngör. 2022b. Turkish abstractive text summarization using pretrained sequence-to-sequence models. Natural Language Engineering, pages 1–30.
  10. Ahmet Bağcı and Mehmet Fatih Amasyali. 2021. Comparison of turkish paraphrase generation models. In 2021 International Conference on INnovations in Intelligent SysTems and Applications (INISTA), pages 1–6.
  11. Enriching word vectors with subword information. Transactions of the association for computational linguistics, 5:135–146.
  12. Buse Çarık and Reyyan Yeniterzi. 2022. A twitter corpus for named entity recognition in turkish. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4546–4551.
  13. Wit3: Web inventory of transcribed and translated talks. In Proceedings of the Conference of European Association for Machine Translation (EAMT), pages 261–268.
  14. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
  15. François Chollet et al. 2015. Keras. https://keras.io.
  16. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  17. Dbmdz. 2023. dbmdz/bert-base-turkish-uncased. Online; accessed 07-Aug-2023.
  18. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  19. XL-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4693–4703, Online. Association for Computational Linguistics.
  20. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  21. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  22. Is chatgpt a good translator? yes with gpt-4 as the engine. arXiv preprint arXiv:2301.08745.
  23. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  24. Abdullatif Köksal and Arzucan Özgür. 2021. Twitter dataset and evaluation of transformers for turkish sentiment analysis. In 2021 29th Signal Processing and Communications Applications Conference (SIU), pages 1–4. IEEE.
  25. Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv preprint arXiv:1804.10959.
  26. Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
  27. Wikilingua: A new benchmark dataset for cross-lingual abstractive summarization. arXiv preprint arXiv:2010.03093.
  28. Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning. arXiv preprint arXiv:2304.05613.
  29. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
  30. Pierre Lison and Jörg Tiedemann. 2016. Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles.
  31. Isolation forest. In 2008 eighth ieee international conference on data mining, pages 413–422. IEEE.
  32. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
  33. Mixed precision training. arXiv preprint arXiv:1710.03740.
  34. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  35. OpenAI. 2022. Chatgpt.
  36. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  37. Anil Ozdemir and Reyyan Yeniterzi. 2020. Su-nlp at semeval-2020 task 12: Offensive language identification in turkish tweets. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 2171–2176.
  38. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
  39. Peker. 2018. Turkish nlp q&a dataset.
  40. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
  41. Deep contextualized word representations.
  42. Improving language understanding by generative pre-training. OpenAI.
  43. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints.
  44. Mukayese: Turkish nlp strikes back. In Findings of the Association for Computational Linguistics: ACL 2022, pages 846–863.
  45. Topic detection based on deep learning language model in turkish microblogs. In 2021 29th signal processing and communications applications conference (SIU), pages 1–4. IEEE.
  46. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  47. Mlsum: The multilingual summarization corpus. arXiv preprint arXiv:2004.14900.
  48. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
  49. Developing and evaluating tiny to medium-sized turkish bert models. arXiv e-prints, pages arXiv–2307.
  50. Attention is all you need. Advances in neural information processing systems, 30.
  51. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  52. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
  53. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934.
  54. Improving massively multilingual neural machine translation and zero-shot translation.
  55. Automated question generation and question answering from turkish texts. Turkish Journal of Electrical Engineering and Computer Sciences, 30(5):1931–1940.
Citations (3)

Summary

We haven't generated a summary for this paper yet.