TurkishBERTweet: Fast and Reliable Large Language Model for Social Media Analysis

Published 29 Nov 2023 in cs.CL, cs.LG, and cs.SI | (2311.18063v1)

Abstract: Turkish is one of the most popular languages in the world. Wide us of this language on social media platforms such as Twitter, Instagram, or Tiktok and strategic position of the country in the world politics makes it appealing for the social network researchers and industry. To address this need, we introduce TurkishBERTweet, the first large scale pre-trained LLM for Turkish social media built using almost 900 million tweets. The model shares the same architecture as base BERT model with smaller input length, making TurkishBERTweet lighter than BERTurk and can have significantly lower inference time. We trained our model using the same approach for RoBERTa model and evaluated on two text classification tasks: Sentiment Classification and Hate Speech Detection. We demonstrate that TurkishBERTweet outperforms the other available alternatives on generalizability and its lower inference time gives significant advantage to process large-scale datasets. We also compared our models with the commercial OpenAI solutions in terms of cost and performance to demonstrate TurkishBERTweet is scalable and cost-effective solution. As part of our research, we released TurkishBERTweet and fine-tuned LoRA adapters for the mentioned tasks under the MIT License to facilitate future research and applications on Turkish social media. Our TurkishBERTweet model is available at: https://github.com/ViralLab/TurkishBERTweet

Abstract PDF HTML Upgrade to Chat

References (39)

Citations (4)

View on Semantic Scholar

Summary

The paper demonstrates that TurkishBERTweet significantly improves sentiment and hate speech detection through a tailored RoBERTa architecture.
It leverages a dataset of 894 million tweets and a specialized tokenizer to capture the nuances of Turkish social media language.
Its efficiency and adaptability offer a cost-effective, open-source solution for real-time Turkish social media monitoring.

The paper "TurkishBERTweet: Fast and Reliable LLM for Social Media Analysis" introduces TurkishBERTweet—a specialized LLM specifically developed to address the unique challenges posed by Turkish social media textual data, particularly tweets. This endeavor is motivated by the significant presence and engagement of Turkish-speaking users on platforms like Twitter, making social media analytics an essential tool for both academic and practical applications.

Methodology and Model Architecture

TurkishBERTweet is built on a RoBERTa architecture with a reduced input sequence length of 128 tokens, making it lighter and more efficient in terms of computational resources compared to existing models, such as BERTurk. It incorporates a specialized tokenizer, augmenting it with domain-specific tokens to better capture the intricacies of social media language. This includes specialized tokens for handling usernames, hashtags, emoticons, URLs, and identifiers that are prevalent in tweets.

The pre-training dataset comprises approximately 894 million Turkish tweets collected over a decade, ensuring extensive coverage of vocabulary variations and context-specific nuances. This comprehensively compiled dataset is a primary factor in the model's ability to outperform other models in similar contexts.

Evaluation and Results

The authors focus on two key NLP tasks to evaluate TurkishBERTweet: Sentiment Analysis and Hate Speech Detection. Across these tasks, TurkishBERTweet demonstrates superior performance, often surpassing other popular LLMs such as BERTurk, mBERT, and even the widely-used generative models like ChatGPT when fine-tuned for Turkish. Notably, TurkishBERTweet combined with LoRA (Low-Rank Adaptation) shows marked improvements in both sentiment and hate speech tasks, signaling its robust generalizability and inference efficiency.

The model's strengths are further highlighted in out-of-distribution evaluations, where its performance remains robust across unfamiliar datasets. This capability is crucial for practical applications where language use continually evolves.

Theoretical and Practical Implications

The release of TurkishBERTweet has substantial implications for Turkish NLP research. It sets a new baseline for modeling social media-specific language, illustrating the benefits of domain-specific model training. Practically, it offers a cost-effective and open-source alternative to proprietary solutions like OpenAI's models, especially meaningful when processing large volumes of tweets for real-time social media monitoring.

Given the adaptations to both lexicon and grammar typical of social media dialects, traditional models often fall short on accuracy and speed. TurkishBERTweet’s efficiency in processing large datasets with lower computational requirements makes it an attractive alternative for real-time applications such as sentiment tracking, social listening, and the identification of emergent topics.

Future Directions

Future work could explore expanding TurkishBERTweet’s applicability to other NLP tasks beyond sentiment and hate speech analysis, such as named entity recognition and author profiling. Additionally, further advancements might involve extending the model architecture to accommodate larger input sequences or integrating multilingual attributes to cater to the dynamic nature of social media interactions where code-switching is frequent.

In conclusion, TurkishBERTweet marks a significant advancement in the field of social media NLP, offering both the Turkish academic community and industry stakeholders a reliable and efficient tool for comprehensive text analysis. As part of an evolving field, the methods and findings presented offer a pathway for further exploration and enhancement of LLMs tuned to specific domains.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (2)

Collections

GitHub

GitHub - ViralLab/TurkishBERTweet: TurkishBERTweet: Fast and Reliable Large Language Model for Social Media Analysis (38 stars)

TurkishBERTweet: Fast and Reliable Large Language Model for Social Media Analysis

Summary

Methodology and Model Architecture

Evaluation and Results

Theoretical and Practical Implications

Future Directions

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (2)

Collections

GitHub

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

TurkishBERTweet: Fast and Reliable Large Language Model for Social Media Analysis

Summary

TurkishBERTweet: A LLM for Turkish Social Media Analysis

Methodology and Model Architecture

Evaluation and Results

Theoretical and Practical Implications

Future Directions

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (2)

Collections

GitHub

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research