TurkishBERTweet: Fast and Reliable Large Language Model for Social Media Analysis
Abstract: Turkish is one of the most popular languages in the world. Wide us of this language on social media platforms such as Twitter, Instagram, or Tiktok and strategic position of the country in the world politics makes it appealing for the social network researchers and industry. To address this need, we introduce TurkishBERTweet, the first large scale pre-trained LLM for Turkish social media built using almost 900 million tweets. The model shares the same architecture as base BERT model with smaller input length, making TurkishBERTweet lighter than BERTurk and can have significantly lower inference time. We trained our model using the same approach for RoBERTa model and evaluated on two text classification tasks: Sentiment Classification and Hate Speech Detection. We demonstrate that TurkishBERTweet outperforms the other available alternatives on generalizability and its lower inference time gives significant advantage to process large-scale datasets. We also compared our models with the commercial OpenAI solutions in terms of cost and performance to demonstrate TurkishBERTweet is scalable and cost-effective solution. As part of our research, we released TurkishBERTweet and fine-tuned LoRA adapters for the mentioned tasks under the MIT License to facilitate future research and applications on Turkish social media. Our TurkishBERTweet model is available at: https://github.com/ViralLab/TurkishBERTweet
- Social media and the organization of collective action: Using twitter to explore the ecologies of two climate change protests. The Communication Review, 14(3):197–215, 2011.
- Summer Harlow. Social media and social movements: Facebook and an online guatemalan justice movement that moved offline. New media & society, 14(2):225–243, 2012.
- What is gained and what is left to be done when content analysis is added to network analysis in the study of a social movement: Twitter use during gezi park. Information, Communication & Society, 20(8):1220–1238, 2017.
- The role of legacy media and social media in increasing public engagement about violence against women in turkey. Social Media+ Society, 8(4):20563051221138939, 2022.
- How noisy social media text, how diffrnt social media sources? In Proceedings of the sixth international joint conference on natural language processing, pages 356–364, 2013.
- Natural language processing for social media. Springer, 2015.
- The growing amplification of social media: Measuring temporal and social contagion dynamics for over 150 languages on twitter for 2009–2020. EPJ data science, 10(1):15, 2021.
- Turkishdelightnlp: A neural turkish nlp toolkit. ACL, 2022.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
- Bertweet: A pre-trained language model for english tweets. arXiv preprint arXiv:2005.10200, 2020.
- Stefan Schweter. Berturk - bert models for turkish, April 2020. URL https://doi.org/10.5281/zenodo.3770924.
- Developing and evaluating tiny to medium-sized turkish bert models. arXiv e-prints, pages arXiv–2307, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- # secim2023: First public dataset for studying turkish general election. arXiv preprint arXiv:2211.13121, 2022.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL https://aclanthology.org/2020.emnlp-demos.6.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL https://aclanthology.org/P16-1162.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Twitter dataset and evaluation of transformers for turkish sentiment analysis. In 2021 29th Signal Processing and Communications Applications Conference (SIU), 2021.
- Turkish tweet sentiment analysis with word embedding and machine learning. In 2017 25th Signal Processing and Communications Applications Conference (SIU), pages 1–4. IEEE, 2017.
- A deep learning approach to sentiment analysis in turkish. In 2018 International Conference on Artificial Intelligence and Data Processing (IDAP), pages 1–5. IEEE, 2018.
- Siu2023-nst-hate speech detection contest. In 2023 31st Signal Processing and Communications Applications Conference (SIU), pages 1–4. IEEE, 2023.
- A twitter corpus for named entity recognition in turkish. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4546–4551, 2022.
- Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. arXiv e-prints, art. arXiv:2201.06642, January 2022.
- Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-9) 2021. Limerick, 12 July 2021 (Online-Event), pages 1 – 9, Mannheim, 2021. Leibniz-Institut für Deutsche Sprache. doi: 10.14618/ids-pub-10468. URL https://nbn-resolving.org/urn:nbn:de:bsz:mh39-104688.
- Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. arXiv e-prints, art. arXiv:2103.12028, March 2021.
- A monolingual approach to contextualized word embeddings for mid-resource languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1703–1714, Online, July 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.acl-main.156.
- Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pages 9 – 16, Mannheim, 2019. Leibniz-Institut f"ur Deutsche Sprache. doi: 10.14618/ids-pub-9021. URL http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934, 2020.
- OpenAI. Chatgpt, 2023. URL https://chat.openai.com/.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
- Just another day on twitter: a complete 24 hours of twitter data. In Proceedings of the International AAAI Conference on Web and Social Media, volume 17, pages 1073–1081, 2023.
- A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity. arXiv preprint arXiv:2305.13169, 2023.
- A tweet dataset annotated for named entity recognition and stance detection. arXiv preprint arXiv:1901.04787, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.