Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set (2003.07372v2)

Published 16 Mar 2020 in cs.SI and q-bio.PE

Abstract: At the time of this writing, the novel coronavirus (COVID-19) pandemic outbreak has already put tremendous strain on many countries' citizens, resources and economies around the world. Social distancing measures, travel bans, self-quarantines, and business closures are changing the very fabric of societies worldwide. With people forced out of public spaces, much conversation about these phenomena now occurs online, e.g., on social media platforms like Twitter. In this paper, we describe a multilingual coronavirus (COVID-19) Twitter dataset that we have been continuously collecting since January 22, 2020. We are making our dataset available to the research community (https://github.com/echen102/COVID-19-TweetIDs). It is our hope that our contribution will enable the study of online conversation dynamics in the context of a planetary-scale epidemic outbreak of unprecedented proportions and implications. This dataset could also help track scientific coronavirus misinformation and unverified rumors, or enable the understanding of fear and panic -- and undoubtedly more. Ultimately, this dataset may contribute towards enabling informed solutions and prescribing targeted policy interventions to fight this global crisis.

Analysis of Public Coronavirus Twitter Data Collection

This paper introduces a comprehensive coronavirus (COVID-19) Twitter dataset assembled by the Information Sciences Institute at the University of Southern California. The dataset originates from January 21, 2020, coinciding with early COVID-19 cases, and captures multilingual Twitter interactions related to the pandemic. The dataset, hosted on the COVID-19-TweetIDs Github repository, is positioned as a vital resource for the research community, facilitating the exploration of social media discourse patterns amidst a global health crisis.

Key Contributions

The primary contribution of this research is the establishment of an extensive, regularly updated dataset of COVID-19-related tweets using Twitter's Streaming API coupled with Tweepy for keyword and account tracking. As of May 2020, the dataset comprises over 123 million tweets, most predominantly in English (65.55%), but also includes a substantial portion of non-English tweets (34.45%) across 67 languages. This dataset allows researchers to investigate social dynamics, misinformation spread, and public sentiment during the pandemic.

Methodological Insights

The data collection strategy capitalizes on real-time streaming and historical data querying, beginning on January 21, 2020. The application of Twitter's public APIs allows for systematic keyword and account tracking based on evolving trends. The approach prioritizes English terms and accounts but accommodates multilingual inputs, although it does introduce a bias towards English-speaking regions. Consequently, researchers are provided with a rich tapestry of social media interactions reflecting global engagement with the pandemic.

Observations and Analysis

Initial analysis correlates Twitter discourse metrics with significant events in the pandemic's timeline, validated by media sources such as Business Insider and CNN. Observations indicate spikes in hashtag frequency aligned with pivotal COVID-19 developments, including WHO's declarations and emerging mortality news. Notably, hashtags containing "coronavirus" dominated early discourse, with usage shifts reflecting the pandemic's progression and public response.

The dataset further reveals language-specific tweet activity reflecting significant events in Italy, Japan, and Spain, revealing the global nature of the crisis' social media implications. Additionally, heightened activity from verified accounts coinciding with major events underscores Twitter's role as a platform for authoritative information dissemination.

Limitations

The usage of Twitter's streaming API limits data capture to roughly 1% of total Twitter volume, affecting dataset comprehensiveness. Furthermore, the predominance of English keywords potentially skews the dataset, though the presence of numerous non-English tweets mitigates this issue partially.

Implications and Future Directions

This dataset represents an invaluable resource for probing into the social implications of the COVID-19 pandemic, offering mechanisms to paper misinformation, communication patterns, and emotional responses. As the pandemic unfolds, further data collection will continue to enable robust, longitudinal analyses. Researchers should also anticipate expanding the dataset's scope by integrating emerging keywords and accounts to ensure comprehensive discourse capture, fostering deeper insights into global responses to health crises.

Conclusion

The development and dissemination of this multilingual COVID-19 Twitter dataset constitute a significant academic asset. Its rich granularity offers researchers a pivotal tool for exploring the intersection between social media and public health in unprecedented times, with the potential to advance understanding across numerous disciplines. The ongoing maintenance and growth of the dataset will be critical in meeting the evolving research demands showcased by the pandemic and similar future events.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Emily Chen (16 papers)
  2. Kristina Lerman (197 papers)
  3. Emilio Ferrara (197 papers)
Citations (170)
Youtube Logo Streamline Icon: https://streamlinehq.com