Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter (2005.07503v1)

Published 15 May 2020 in cs.CL, cs.LG, and cs.SI

Abstract: In this work, we release COVID-Twitter-BERT (CT-BERT), a transformer-based model, pretrained on a large corpus of Twitter messages on the topic of COVID-19. Our model shows a 10-30% marginal improvement compared to its base model, BERT-Large, on five different classification datasets. The largest improvements are on the target domain. Pretrained transformer models, such as CT-BERT, are trained on a specific target domain and can be used for a wide variety of natural language processing tasks, including classification, question-answering and chatbots. CT-BERT is optimised to be used on COVID-19 content, in particular social media posts from Twitter.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Martin Müller (25 papers)
  2. Marcel Salathé (27 papers)
  3. Per E Kummervold (3 papers)
Citations (337)

Summary

Analysis of COVID-Twitter-BERT: A Domain-Specific NLP Model for COVID-19-related Social Media Content

The research article introduces COVID-Twitter-BERT (CT-BERT), a transformer-based NLP model tailored specifically for analyzing social media content related to COVID-19. The model is an adaptation of the widely recognized BERT-Large model, customized via domain-specific pretraining on a substantial corpus of COVID-19-related Twitter data. This research provides a comprehensive evaluation of CT-BERT across multiple classification tasks, demonstrating notable improvements in performance, particularly on COVID-19 and health-related datasets.

Methodology

CT-BERT was developed by leveraging a dataset of approximately 160 million tweets focusing on COVID-19, collected through the Crowdbreaks platform. The tweets underwent a preprocessing phase, involving pseudonymization and removal of duplicates, resulting in a training corpus of 22.5 million distinct tweets comprising 0.6 billion words. The model employs traditional masked LLMing (MLM) and next sentence prediction (NSP) training techniques administered over several hundred million training examples on a TPU v3-8 infrastructure with TensorFlow 2.2.

The evaluation encompassed five classification datasets, three publicly available and two internally curated, each featuring varying degrees of class imbalance and dataset content, with a focus on COVID-19 and vaccination sentiment analysis. CT-BERT was trained and tested on these datasets to compare its performance against the base BERT-Large model.

Results

Across all datasets, CT-BERT consistently outperformed BERT-Large, with an average F1-score improvement from 0.802 to 0.833, representing a 17.57% relative improvement in marginal performance. Notably, the largest gains were observed in the COVID-19-specific and vaccination-related data, emphasizing its specialized capability in the health information domain.

The paper also conducted a progressive evaluation of intermediary pretraining checkpoints, uncovering that significant performance improvements were concentrated within the initial 200,000 pretraining steps, beyond which marginal enhancements plateaued. This pattern underscores the swift adaptation of CT-BERT to domain-specific language tasks when scaffolded upon an already robust base model.

Discussion

CT-BERT's performance highlights the effectiveness of domain-specific pretraining in NLP for targeted application areas such as public health communication. It is elucidated that while pretraining metrics offer a glimpse into model progression, they do not linearly correlate with downstream task performance, necessitating thorough empirical evaluation.

Although the paper focuses on sentiment classification tasks, CT-BERT's architecture suggests potential applicability across a broader spectrum of NLP tasks, ranging from named entity recognition to question answering, within COVID-19 and possibly other health-related discourse.

The researchers acknowledge that further optimization of finetuning parameters—initially configured for BERT-Large—could further enhance CT-BERT's performance. Additionally, extending evaluations to include more diverse datasets and recent corpus data could solidify CT-BERT’s position as a valuable tool in digital epidemiology and crisis informatics.

Conclusion

COVID-Twitter-BERT marks a significant advancement in leveraging transformer-based models for specialized domains, reinforcing the importance of tailoring LLMs to the specific idiosyncrasies of target domains. This model, backed by the empirical evidence presented, offers a substantial resource for analyzing social media content related to the pandemic, providing timely insights into public sentiment and communication patterns during global health crises. Prospective work involves expanding upon this foundation to encompass broader AI-driven analyses within public health and beyond.