Papers
Topics
Authors
Recent
Search
2000 character limit reached

TurkishBERTweet: Fast and Reliable Large Language Model for Social Media Analysis

Published 29 Nov 2023 in cs.CL, cs.LG, and cs.SI | (2311.18063v1)

Abstract: Turkish is one of the most popular languages in the world. Wide us of this language on social media platforms such as Twitter, Instagram, or Tiktok and strategic position of the country in the world politics makes it appealing for the social network researchers and industry. To address this need, we introduce TurkishBERTweet, the first large scale pre-trained LLM for Turkish social media built using almost 900 million tweets. The model shares the same architecture as base BERT model with smaller input length, making TurkishBERTweet lighter than BERTurk and can have significantly lower inference time. We trained our model using the same approach for RoBERTa model and evaluated on two text classification tasks: Sentiment Classification and Hate Speech Detection. We demonstrate that TurkishBERTweet outperforms the other available alternatives on generalizability and its lower inference time gives significant advantage to process large-scale datasets. We also compared our models with the commercial OpenAI solutions in terms of cost and performance to demonstrate TurkishBERTweet is scalable and cost-effective solution. As part of our research, we released TurkishBERTweet and fine-tuned LoRA adapters for the mentioned tasks under the MIT License to facilitate future research and applications on Turkish social media. Our TurkishBERTweet model is available at: https://github.com/ViralLab/TurkishBERTweet

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Social media and the organization of collective action: Using twitter to explore the ecologies of two climate change protests. The Communication Review, 14(3):197–215, 2011.
  2. Summer Harlow. Social media and social movements: Facebook and an online guatemalan justice movement that moved offline. New media & society, 14(2):225–243, 2012.
  3. What is gained and what is left to be done when content analysis is added to network analysis in the study of a social movement: Twitter use during gezi park. Information, Communication & Society, 20(8):1220–1238, 2017.
  4. The role of legacy media and social media in increasing public engagement about violence against women in turkey. Social Media+ Society, 8(4):20563051221138939, 2022.
  5. How noisy social media text, how diffrnt social media sources? In Proceedings of the sixth international joint conference on natural language processing, pages 356–364, 2013.
  6. Natural language processing for social media. Springer, 2015.
  7. The growing amplification of social media: Measuring temporal and social contagion dynamics for over 150 languages on twitter for 2009–2020. EPJ data science, 10(1):15, 2021.
  8. Turkishdelightnlp: A neural turkish nlp toolkit. ACL, 2022.
  9. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  10. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  11. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  12. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
  13. Bertweet: A pre-trained language model for english tweets. arXiv preprint arXiv:2005.10200, 2020.
  14. Stefan Schweter. Berturk - bert models for turkish, April 2020. URL https://doi.org/10.5281/zenodo.3770924.
  15. Developing and evaluating tiny to medium-sized turkish bert models. arXiv e-prints, pages arXiv–2307, 2023.
  16. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  17. # secim2023: First public dataset for studying turkish general election. arXiv preprint arXiv:2211.13121, 2022.
  18. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL https://aclanthology.org/2020.emnlp-demos.6.
  19. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL https://aclanthology.org/P16-1162.
  20. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  21. Twitter dataset and evaluation of transformers for turkish sentiment analysis. In 2021 29th Signal Processing and Communications Applications Conference (SIU), 2021.
  22. Turkish tweet sentiment analysis with word embedding and machine learning. In 2017 25th Signal Processing and Communications Applications Conference (SIU), pages 1–4. IEEE, 2017.
  23. A deep learning approach to sentiment analysis in turkish. In 2018 International Conference on Artificial Intelligence and Data Processing (IDAP), pages 1–5. IEEE, 2018.
  24. Siu2023-nst-hate speech detection contest. In 2023 31st Signal Processing and Communications Applications Conference (SIU), pages 1–4. IEEE, 2023.
  25. A twitter corpus for named entity recognition in turkish. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4546–4551, 2022.
  26. Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. arXiv e-prints, art. arXiv:2201.06642, January 2022.
  27. Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-9) 2021. Limerick, 12 July 2021 (Online-Event), pages 1 – 9, Mannheim, 2021. Leibniz-Institut für Deutsche Sprache. doi: 10.14618/ids-pub-10468. URL https://nbn-resolving.org/urn:nbn:de:bsz:mh39-104688.
  28. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. arXiv e-prints, art. arXiv:2103.12028, March 2021.
  29. A monolingual approach to contextualized word embeddings for mid-resource languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1703–1714, Online, July 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.acl-main.156.
  30. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pages 9 – 16, Mannheim, 2019. Leibniz-Institut f"ur Deutsche Sprache. doi: 10.14618/ids-pub-9021. URL http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215.
  31. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805.
  32. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  33. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934, 2020.
  34. OpenAI. Chatgpt, 2023. URL https://chat.openai.com/.
  35. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  36. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  37. Just another day on twitter: a complete 24 hours of twitter data. In Proceedings of the International AAAI Conference on Web and Social Media, volume 17, pages 1073–1081, 2023.
  38. A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity. arXiv preprint arXiv:2305.13169, 2023.
  39. A tweet dataset annotated for named entity recognition and stance detection. arXiv preprint arXiv:1901.04787, 2019.
Citations (4)

Summary

  • The paper demonstrates that TurkishBERTweet significantly improves sentiment and hate speech detection through a tailored RoBERTa architecture.
  • It leverages a dataset of 894 million tweets and a specialized tokenizer to capture the nuances of Turkish social media language.
  • Its efficiency and adaptability offer a cost-effective, open-source solution for real-time Turkish social media monitoring.

TurkishBERTweet: A LLM for Turkish Social Media Analysis

The paper "TurkishBERTweet: Fast and Reliable LLM for Social Media Analysis" introduces TurkishBERTweet—a specialized LLM specifically developed to address the unique challenges posed by Turkish social media textual data, particularly tweets. This endeavor is motivated by the significant presence and engagement of Turkish-speaking users on platforms like Twitter, making social media analytics an essential tool for both academic and practical applications.

Methodology and Model Architecture

TurkishBERTweet is built on a RoBERTa architecture with a reduced input sequence length of 128 tokens, making it lighter and more efficient in terms of computational resources compared to existing models, such as BERTurk. It incorporates a specialized tokenizer, augmenting it with domain-specific tokens to better capture the intricacies of social media language. This includes specialized tokens for handling usernames, hashtags, emoticons, URLs, and identifiers that are prevalent in tweets.

The pre-training dataset comprises approximately 894 million Turkish tweets collected over a decade, ensuring extensive coverage of vocabulary variations and context-specific nuances. This comprehensively compiled dataset is a primary factor in the model's ability to outperform other models in similar contexts.

Evaluation and Results

The authors focus on two key NLP tasks to evaluate TurkishBERTweet: Sentiment Analysis and Hate Speech Detection. Across these tasks, TurkishBERTweet demonstrates superior performance, often surpassing other popular LLMs such as BERTurk, mBERT, and even the widely-used generative models like ChatGPT when fine-tuned for Turkish. Notably, TurkishBERTweet combined with LoRA (Low-Rank Adaptation) shows marked improvements in both sentiment and hate speech tasks, signaling its robust generalizability and inference efficiency.

The model's strengths are further highlighted in out-of-distribution evaluations, where its performance remains robust across unfamiliar datasets. This capability is crucial for practical applications where language use continually evolves.

Theoretical and Practical Implications

The release of TurkishBERTweet has substantial implications for Turkish NLP research. It sets a new baseline for modeling social media-specific language, illustrating the benefits of domain-specific model training. Practically, it offers a cost-effective and open-source alternative to proprietary solutions like OpenAI's models, especially meaningful when processing large volumes of tweets for real-time social media monitoring.

Given the adaptations to both lexicon and grammar typical of social media dialects, traditional models often fall short on accuracy and speed. TurkishBERTweet’s efficiency in processing large datasets with lower computational requirements makes it an attractive alternative for real-time applications such as sentiment tracking, social listening, and the identification of emergent topics.

Future Directions

Future work could explore expanding TurkishBERTweet’s applicability to other NLP tasks beyond sentiment and hate speech analysis, such as named entity recognition and author profiling. Additionally, further advancements might involve extending the model architecture to accommodate larger input sequences or integrating multilingual attributes to cater to the dynamic nature of social media interactions where code-switching is frequent.

In conclusion, TurkishBERTweet marks a significant advancement in the field of social media NLP, offering both the Turkish academic community and industry stakeholders a reliable and efficient tool for comprehensive text analysis. As part of an evolving field, the methods and findings presented offer a pathway for further exploration and enhancement of LLMs tuned to specific domains.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 58 likes about this paper.