Analysis of Public Coronavirus Twitter Data Collection
This paper introduces a comprehensive coronavirus (COVID-19) Twitter dataset assembled by the Information Sciences Institute at the University of Southern California. The dataset originates from January 21, 2020, coinciding with early COVID-19 cases, and captures multilingual Twitter interactions related to the pandemic. The dataset, hosted on the COVID-19-TweetIDs Github repository, is positioned as a vital resource for the research community, facilitating the exploration of social media discourse patterns amidst a global health crisis.
Key Contributions
The primary contribution of this research is the establishment of an extensive, regularly updated dataset of COVID-19-related tweets using Twitter's Streaming API coupled with Tweepy for keyword and account tracking. As of May 2020, the dataset comprises over 123 million tweets, most predominantly in English (65.55%), but also includes a substantial portion of non-English tweets (34.45%) across 67 languages. This dataset allows researchers to investigate social dynamics, misinformation spread, and public sentiment during the pandemic.
Methodological Insights
The data collection strategy capitalizes on real-time streaming and historical data querying, beginning on January 21, 2020. The application of Twitter's public APIs allows for systematic keyword and account tracking based on evolving trends. The approach prioritizes English terms and accounts but accommodates multilingual inputs, although it does introduce a bias towards English-speaking regions. Consequently, researchers are provided with a rich tapestry of social media interactions reflecting global engagement with the pandemic.
Observations and Analysis
Initial analysis correlates Twitter discourse metrics with significant events in the pandemic's timeline, validated by media sources such as Business Insider and CNN. Observations indicate spikes in hashtag frequency aligned with pivotal COVID-19 developments, including WHO's declarations and emerging mortality news. Notably, hashtags containing "coronavirus" dominated early discourse, with usage shifts reflecting the pandemic's progression and public response.
The dataset further reveals language-specific tweet activity reflecting significant events in Italy, Japan, and Spain, revealing the global nature of the crisis' social media implications. Additionally, heightened activity from verified accounts coinciding with major events underscores Twitter's role as a platform for authoritative information dissemination.
Limitations
The usage of Twitter's streaming API limits data capture to roughly 1% of total Twitter volume, affecting dataset comprehensiveness. Furthermore, the predominance of English keywords potentially skews the dataset, though the presence of numerous non-English tweets mitigates this issue partially.
Implications and Future Directions
This dataset represents an invaluable resource for probing into the social implications of the COVID-19 pandemic, offering mechanisms to paper misinformation, communication patterns, and emotional responses. As the pandemic unfolds, further data collection will continue to enable robust, longitudinal analyses. Researchers should also anticipate expanding the dataset's scope by integrating emerging keywords and accounts to ensure comprehensive discourse capture, fostering deeper insights into global responses to health crises.
Conclusion
The development and dissemination of this multilingual COVID-19 Twitter dataset constitute a significant academic asset. Its rich granularity offers researchers a pivotal tool for exploring the intersection between social media and public health in unprecedented times, with the potential to advance understanding across numerous disciplines. The ongoing maintenance and growth of the dataset will be critical in meeting the evolving research demands showcased by the pandemic and similar future events.