BERTweet: A pre-trained language model for English Tweets (2005.10200v2)

Published 20 May 2020 in cs.CL and cs.LG

Abstract: We present BERTweet, the first public large-scale pre-trained LLM for English Tweets. Our BERTweet, having the same architecture as BERT-base (Devlin et al., 2019), is trained using the RoBERTa pre-training procedure (Liu et al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa-base and XLM-R-base (Conneau et al., 2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks: Part-of-speech tagging, Named-entity recognition and text classification. We release BERTweet under the MIT License to facilitate future research and applications on Tweet data. Our BERTweet is available at https://github.com/VinAIResearch/BERTweet

Authors (3)

Dat Quoc Nguyen (55 papers)
Thanh Vu (59 papers)
Anh Tuan Nguyen (17 papers)

Citations (833)

View on Semantic Scholar

Summary

BERTweet: A Pre-trained LLM for English Tweets

The paper "BERTweet: A pre-trained LLM for English Tweets" by Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen introduces BERTweet, a large-scale LLM specifically tailored for English tweets. Leveraging the BERT\textsubscript{base} architecture and employing the RoBERTa pre-training procedure, BERTweet sets out to address the gap in existing LLMs which are predominantly tuned for more conventional forms of written text such as those found in Wikipedia or news articles.

Introduction

Building upon the architecture of BERT and its derivatives, the authors have developed BERTweet to specifically cater to the unique linguistic characteristics of Twitter data, which include abbreviated forms, informal grammar, and frequent typographical errors. The importance of Twitter as a source of real-time information across various domains necessitated the creation of this specialized model. The literature prior to this work lacked a LLM pre-trained on a data corpus as vast as the one used here—an 80GB corpus containing 850 million English tweets.

Architecture and Data

BERTweet retains the BERT\textsubscript{base} configuration and follows the RoBERTa pre-training procedure. The pre-training dataset encompasses tweets collected both from a general Twitter stream (2012 to 2019) and tweets related to the COVID-19 pandemic (early 2020). This dataset spans 16 billion word tokens, each tweet containing between 10 to 64 tokens, ensuring a rich and diverse training set that captures a wide range of tweet styles and topics. Tokenization and pre-processing steps were meticulously designed to account for Twitter-specific linguistic idiosyncrasies, including the normalization of user mentions and URLs.

The optimization was performed using the RoBERTa implementation in the fairseq library, employing a masked LLMing objective over a period of approximately four weeks. The training involved extensive hyperparameter tuning, especially around the sequence length, batch size, and learning rate, optimized for performance on NVIDIA V100 GPUs.

Experimental Setup and Results

The paper evaluates BERTweet on three downstream Tweet NLP tasks: Part-of-speech (POS) tagging, Named-entity recognition (NER), and text classification.

POS Tagging: Datasets include Ritter11-T-POS, ARK-Twitter, and Tweebank-v2. Experiments demonstrate that BERTweet either matches or surpasses the performance of existing models, including RoBERTa\textsubscript{base} and XLM-R\textsubscript{base}, indicating its robustness in understanding tweet-specific syntax.
NER: Evaluations were conducted on the WNUT16 and WNUT17 datasets, with BERTweet outperforming previous models by significant margins, especially on the WNUT17 dataset, where it surpassed the state-of-the-art by more than 14%.
Text Classification: Using the SemEval2017 Task 4A dataset for sentiment analysis and SemEval2018 Task 3A for irony detection, BERTweet achieved superior performance metrics compared to both large and base models of RoBERTa and XLM-R.

Across all tasks, BERTweet's performance consistently outpaced traditional models pre-trained on more conventional data corpora. This is a testament to the significance of domain-specific pre-training in enhancing NLP task performance.

Discussion and Implications

The experimental results highlight the efficacy of domain-specific LLMs. Despite the less extensive training data compared to models like RoBERTa and XLM-R, BERTweet's tailored pre-training on tweets enabled it to achieve superior performance in tasks involving informal and irregular text. This reinforces the notion that domain-specific linguistic phenomena are best captured by models pre-trained on domain-specific corpora.

Future directions for this line of research could include extending BERTweet to a "large" configuration which might further enhance performance. Additionally, extending the pre-training to multilingual tweet corpora could produce models capable of understanding the diversity of linguistic styles across different languages on Twitter.

Conclusion

BERTweet stands out as an impactful contribution to the deployment of NLP technology on social media text. By leveraging a substantial corpus of tweets and a proven pre-training method, BERTweet not only sets new performance benchmarks but also opens the door for further innovations in tweet-specific LLMing. This work lays a solid foundation for the development of even more sophisticated models tailored to the ever-evolving and diverse landscape of social media communication.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - VinAIResearch/BERTweet: BERTweet: A pre-trained language model for English Tweets (EMNLP-2020) (569 stars)