Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set

Published 16 Mar 2020 in cs.SI and q-bio.PE | (2003.07372v2)

Abstract: At the time of this writing, the novel coronavirus (COVID-19) pandemic outbreak has already put tremendous strain on many countries' citizens, resources and economies around the world. Social distancing measures, travel bans, self-quarantines, and business closures are changing the very fabric of societies worldwide. With people forced out of public spaces, much conversation about these phenomena now occurs online, e.g., on social media platforms like Twitter. In this paper, we describe a multilingual coronavirus (COVID-19) Twitter dataset that we have been continuously collecting since January 22, 2020. We are making our dataset available to the research community (https://github.com/echen102/COVID-19-TweetIDs). It is our hope that our contribution will enable the study of online conversation dynamics in the context of a planetary-scale epidemic outbreak of unprecedented proportions and implications. This dataset could also help track scientific coronavirus misinformation and unverified rumors, or enable the understanding of fear and panic -- and undoubtedly more. Ultimately, this dataset may contribute towards enabling informed solutions and prescribing targeted policy interventions to fight this global crisis.

Abstract PDF Upgrade to Chat

Citations (170)

View on Semantic Scholar

Summary

The paper establishes a comprehensive COVID-19 Twitter dataset with over 123 million tweets collected since January 2020.
It employs Twitter's Streaming API and Tweepy for real-time and historical data gathering, facilitating analysis of global discourse and misinformation.
Findings reveal language-specific engagement patterns and hashtag spikes aligned with key pandemic events, highlighting dynamic public responses.

Analysis of Public Coronavirus Twitter Data Collection

This paper introduces a comprehensive coronavirus (COVID-19) Twitter dataset assembled by the Information Sciences Institute at the University of Southern California. The dataset originates from January 21, 2020, coinciding with early COVID-19 cases, and captures multilingual Twitter interactions related to the pandemic. The dataset, hosted on the COVID-19-TweetIDs Github repository, is positioned as a vital resource for the research community, facilitating the exploration of social media discourse patterns amidst a global health crisis.

Key Contributions

The primary contribution of this research is the establishment of an extensive, regularly updated dataset of COVID-19-related tweets using Twitter's Streaming API coupled with Tweepy for keyword and account tracking. As of May 2020, the dataset comprises over 123 million tweets, most predominantly in English (65.55%), but also includes a substantial portion of non-English tweets (34.45%) across 67 languages. This dataset allows researchers to investigate social dynamics, misinformation spread, and public sentiment during the pandemic.

Methodological Insights

The data collection strategy capitalizes on real-time streaming and historical data querying, beginning on January 21, 2020. The application of Twitter's public APIs allows for systematic keyword and account tracking based on evolving trends. The approach prioritizes English terms and accounts but accommodates multilingual inputs, although it does introduce a bias towards English-speaking regions. Consequently, researchers are provided with a rich tapestry of social media interactions reflecting global engagement with the pandemic.

Observations and Analysis

Initial analysis correlates Twitter discourse metrics with significant events in the pandemic's timeline, validated by media sources such as Business Insider and CNN. Observations indicate spikes in hashtag frequency aligned with pivotal COVID-19 developments, including WHO's declarations and emerging mortality news. Notably, hashtags containing "coronavirus" dominated early discourse, with usage shifts reflecting the pandemic's progression and public response.

The dataset further reveals language-specific tweet activity reflecting significant events in Italy, Japan, and Spain, revealing the global nature of the crisis' social media implications. Additionally, heightened activity from verified accounts coinciding with major events underscores Twitter's role as a platform for authoritative information dissemination.

Limitations

The usage of Twitter's streaming API limits data capture to roughly 1% of total Twitter volume, affecting dataset comprehensiveness. Furthermore, the predominance of English keywords potentially skews the dataset, though the presence of numerous non-English tweets mitigates this issue partially.

Implications and Future Directions

This dataset represents an invaluable resource for probing into the social implications of the COVID-19 pandemic, offering mechanisms to study misinformation, communication patterns, and emotional responses. As the pandemic unfolds, further data collection will continue to enable robust, longitudinal analyses. Researchers should also anticipate expanding the dataset's scope by integrating emerging keywords and accounts to ensure comprehensive discourse capture, fostering deeper insights into global responses to health crises.

Conclusion

The development and dissemination of this multilingual COVID-19 Twitter dataset constitute a significant academic asset. Its rich granularity offers researchers a pivotal tool for exploring the intersection between social media and public health in unprecedented times, with the potential to advance understanding across numerous disciplines. The ongoing maintenance and growth of the dataset will be critical in meeting the evolving research demands showcased by the pandemic and similar future events.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (3)

Collections

YouTube

Show All Videos

Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set

Summary

Analysis of Public Coronavirus Twitter Data Collection

Key Contributions

Methodological Insights

Observations and Analysis

Limitations

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (3)

Collections

YouTube

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set

Summary

Analysis of Public Coronavirus Twitter Data Collection

Key Contributions

Methodological Insights

Observations and Analysis

Limitations

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (3)

Collections

YouTube

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research