A large-scale COVID-19 Twitter chatter dataset for open scientific research -- an international collaboration

Published 7 Apr 2020 in cs.SI and cs.IR | (2004.03688v2)

Abstract: As the COVID-19 pandemic continues its march around the world, an unprecedented amount of open data is being generated for genetics and epidemiological research. The unparalleled rate at which many research groups around the world are releasing data and publications on the ongoing pandemic is allowing other scientists to learn from local experiences and data generated in the front lines of the COVID-19 pandemic. However, there is a need to integrate additional data sources that map and measure the role of social dynamics of such a unique world-wide event into biomedical, biological, and epidemiological analyses. For this purpose, we present a large-scale curated dataset of over 152 million tweets, growing daily, related to COVID-19 chatter generated from January 1st to April 4th at the time of writing. This open dataset will allow researchers to conduct a number of research projects relating to the emotional and mental responses to social distancing measures, the identification of sources of misinformation, and the stratified measurement of sentiment towards the pandemic in near real time.

Abstract PDF Upgrade to Chat

Citations (286)

View on Semantic Scholar

Summary

The paper introduces a large-scale, continually updated dataset of over 383 million COVID-19-related tweets for open scientific research.
Collected via the Twitter Stream API and adhering to FAIR principles, the dataset includes metadata and preprocessed versions, accessible through provided tools.
This dataset enables researchers to study pandemic-related social dynamics, misinformation, public sentiment, and communication patterns.

Overview of "A Large-scale COVID-19 Twitter Chatter Dataset for Open Scientific Research"

The paper "A Large-scale COVID-19 Twitter Chatter Dataset for Open Scientific Research - An International Collaboration" presents a meticulously curated dataset comprised of over 383 million tweets related to the COVID-19 pandemic. The dataset spans from January 1st to June 7th, which is continually expanding, offers a comprehensive resource for researchers interested in examining the socio-dynamic implications of the pandemic in real-time. This initiative emerged from the collective efforts of researchers across various disciplines, underscoring the criticality of data sharing in fostering scientific advancements, particularly during global crises such as the COVID-19 pandemic.

Dataset Composition and Methodology

The dataset is constructed from data collected via the publicly available Twitter Stream API, and efforts were further supported by collaborative contributions. A shift in data collection strategy on March 12th, 2020, saw an expanded focus exclusively on COVID-19-related tweets using specific keywords. This dataset contains detailed metadata including tweet IDs, dates, and times, among other identifiers. The project adheres to the FAIR data principles, ensuring accessibility and usability by the scientific community, although it is bound by Twitter's terms limiting the sharing of direct tweet text.

Python, along with tools such as Tweepy, facilitated the collection and integration of tweets. As part of the data curation process, the authors have preprocessed the tweets to filter out retweets for researchers with computing limitations, and provided an additional clean version of the dataset. Additionally, they supply frequent terms and n-grams to assist researchers focusing on linguistic analysis or NLP tasks.

Data Volume and Accessibility

The temporal breadth of this dataset is profound, with monthly tweet volumes capturing the evolving public discourse. For instance, April and May account for significant spikes in dataset volume, with tweet counts exceeding 120 million, indicative of heightened public interaction during these months. Furthermore, the dataset is extended by incorporating 1.5 million tweets in Russian, expanding its linguistic and geographic diversity. The updating cadence of every two days bolsters real-time research utility, a critical asset for continuously evolving pandemics.

Research Implications and Utility

The potential applications of this dataset are extensive. Researchers can explore multifaceted aspects of the pandemic, including misinformation dissemination, public sentiments regarding health measures, and emotional responses to various stages of the pandemic. Such analyses could yield insights into societal behaviors and communication patterns during crises, informing both public health strategies and communication policies.

By making all requisite preprocessing tools publicly available via a repository, the authors facilitate wider accessibility and usability, enabling efficient parsing, cleaning, and processing of the dataset in diverse research contexts. This documentation is particularly beneficial for scaling analysis across large data volumes, characteristic of social media platforms.

Future Directions

As the pandemic continues, the ongoing expansion and updating of this dataset ensure its relevance. Future efforts may include the incorporation of more diverse linguistic datasets to further enhance the scope of cross-cultural comparisons. The sustained data collection and sharing framework established here could serve as a model for other large-scale social data initiatives, highlighting the efficacy of collaborative and open-data practices in advancing collective scientific understanding and response readiness.

This dataset stands as a testament to the confluence of multi-institutional cooperation, technological integration, and methodological rigor in response to an unprecedented global event. As such, it holds significant promise for ongoing and future research endeavors.

Markdown