CHECKED: Chinese COVID-19 Fake News Dataset

Published 18 Oct 2020 in cs.SI and cs.IR | (2010.09029v2)

Abstract: COVID-19 has impacted all lives. To maintain social distancing and avoiding exposure, works and lives have gradually moved online. Under this trend, social media usage to obtain COVID-19 news has increased. Also, misinformation on COVID-19 is frequently spread on social media. In this work, we develop CHECKED, the first Chinese dataset on COVID-19 misinformation. CHECKED provides a total 2,104 verified microblogs related to COVID-19 from December 2019 to August 2020, identified by using a specific list of keywords. Correspondingly, CHECKED includes 1,868,175 reposts, 1,185,702 comments, and 56,852,736 likes that reveal how these verified microblogs are spread and reacted on Weibo. The dataset contains a rich set of multimedia information for each microblog including ground-truth label, textual, visual, temporal, and network information. Extensive experiments have been conducted to analyze CHECKED data and to provide benchmark results for well-established methods when predicting fake news using CHECKED. We hope that CHECKED can facilitate studies that target misinformation on coronavirus. The dataset is available at https://github.com/cyang03/CHECKED.

Abstract PDF Upgrade to Chat

Citations (60)

View on Semantic Scholar

Summary

The paper introduces the first detailed Chinese COVID-19 fake news dataset from Weibo that combines textual, visual, and network data for misinformation research.
It includes 2,104 verified microblogs with comprehensive metrics such as 56M likes, enabling nuanced analysis of misinformation spread.
Benchmark experiments show TextCNN achieving a macro F1 score of 0.938, illustrating its effectiveness in detecting COVID-19 misinformation.

CHECKED: A Dataset for Investigating Chinese COVID-19 Misinformation

The paper "CHECKED: Chinese COVID-19 Fake News Dataset" presents a significant step in the pursuit of understanding misinformation dynamics on social media during the COVID-19 pandemic. The researchers introduce a novel dataset specifically curated to analyze COVID-19-related misinformation spread on Weibo, which is a primary social media platform in China. This dataset is pivotal as it is the first of its kind to focus on COVID-19 misinformation in the Chinese context, thereby filling a critical gap in existing data resources for misinformation research.

Overview of the CHECKED Dataset

The CHECKED dataset comprises 2,104 verified microblogs from Weibo, spanning the period from December 2019 to August 2020. The dataset includes detailed information such as textual content, multimedia elements (images and videos), temporal data, and network interactions like comments, reposts, and likes. These microblogs have been fact-checked and labeled as "real" or "fake," providing a ground truth for researchers studying misinformation.

Notably, the dataset contains 344 microblogs classified as fake and 1,760 as real, alongside an extensive aggregation of user activity data: 1,868,175 reposts, 1,185,702 comments, and 56,852,736 likes. By providing these comprehensive metrics, the dataset enables an in-depth exploration of how misinformation circulates within the social media environment and attracts user interaction.

Key Contributions and Methodology

The primary contributions of this work are multi-faceted:

Dataset Creation: CHECKED is the first Chinese dataset that combines COVID-19-related microblogs with misinformation labels, offering a foundation for studying the spread and characteristics of fake news in the Chinese language and cultural context.
Feature-Rich Data: The dataset is enriched with features that encompass textual, visual, network, and temporal dimensions, allowing a multifaceted analysis of the content and its dissemination across the platform.
Benchmarking and Experiments: The authors provide benchmark results using well-established text classification methods, including FastText, TextCNN, and Transformer models. TextCNN, in particular, achieves the highest macro F1 score of 0.938, highlighting its efficacy in this context.

Implications and Future Directions

The CHECKED dataset offers critical insights into misinformation dynamics specific to Chinese social media, which is vital given the global nature of the COVID-19 pandemic and the cultural and linguistic specificity of misinformation. From a theoretical perspective, the dataset enables exploration into the properties and features of misinformation in a non-Western context, potentially leading to more generalized theories of misinformation spread.

Practically, the dataset can inform the development of more effective detection systems tailored to Chinese language and social media use. Moreover, it underscores the importance of localized misinformation research and detection tools, which can be integrated with global efforts to combat misinformation.

Future work could involve expanding the dataset to encompass additional sources of misinformation, integrating cross-linguistic analyses with datasets from other languages, or exploring the integration of multimodal data (e.g., video and text) for enhanced fake news detection.

In summary, the CHECKED dataset stands as a crucial resource for advancing research in misinformation dynamics, particularly within the context of a global health crisis, and emphasizes the need for culturally and linguistically nuanced approaches to misinformation detection and analysis.

Markdown