ReCOVery: A Multimodal Repository for COVID-19 News Credibility Research (2006.05557v2)

Published 9 Jun 2020 in cs.SI and cs.IR

Abstract: First identified in Wuhan, China, in December 2019, the outbreak of COVID-19 has been declared as a global emergency in January, and a pandemic in March 2020 by the World Health Organization (WHO). Along with this pandemic, we are also experiencing an "infodemic" of information with low credibility such as fake news and conspiracies. In this work, we present ReCOVery, a repository designed and constructed to facilitate research on combating such information regarding COVID-19. We first broadly search and investigate ~2,000 news publishers, from which 60 are identified with extreme [high or low] levels of credibility. By inheriting the credibility of the media on which they were published, a total of 2,029 news articles on coronavirus, published from January to May 2020, are collected in the repository, along with 140,820 tweets that reveal how these news articles have spread on the Twitter social network. The repository provides multimodal information of news articles on coronavirus, including textual, visual, temporal, and network information. The way that news credibility is obtained allows a trade-off between dataset scalability and label accuracy. Extensive experiments are conducted to present data statistics and distributions, as well as to provide baseline performances for predicting news credibility so that future methods can be compared. Our repository is available at http://coronavirus-fakenews.com.

PDF Abstract

An Analysis of the $\mathsf{ReCOVery}$ Repository for COVID-19 News Credibility Research

The presented research introduces the $\mathsf{ReCOVery}$ repository, a comprehensive multimodal dataset established to assist in the assessment of news credibility concerning COVID-19. It responds to the simultaneous pandemics of health and "infodemic" — the flood of misinformation accompanying the coronavirus crisis. By compiling a dataset of 2,029 news articles and 140,820 related tweets, the repository endows researchers with a toolkit comprising textual, visual, temporal, and network information to dissect how COVID-19 news propagates on social media.

Methodology and Data Collection

The repository aggregates data from approximately 2,000 news publishers. Sixty publishers were meticulously curated based on credibility standards defined by NewsGuard and Media Bias/Fact Check. To ensure the reliability of the data, publishers were categorized with scores depicting extreme credibility levels: those exceeding a score of 90 were deemed highly reliable, while those below 30 were marked as unreliable. This classification method provides a strategic trade-off between dataset scalability and labeling accuracy, thus facilitating more extensive research applications.

Repository Features

Each news article in the repository is annotated with several attributes crucial for credibility analysis:

Textual and Visual Content: Articles are sourced with both their text and main image, enabling multi-modal analysis.
Metadata and Social Data: Publications are timestamped and geo-tagged, while political biases of the publishers and the spread of these articles across Twitter are rigorously documented. This repository, therefore, forms a basis for examining the temporal and spatial dynamics of COVID-19 misinformation.
Social Media Propagation: A comprehensive list of tweets containing these articles aids in understanding how information dissemination patterns can affect public perception during pandemics.

Baseline Experiments

This paper endeavors to benchmark predictive models for news credibility using traditional and advanced methodologies, including Decision Trees with LIWC features, Rhetorical Structure Theory (RST) analysis, Text-CNN, and SAFE — a similarity-aware model leveraging text and image data. Results indicate that multi-modal approaches, particularly those incorporating neural architectures like SAFE, outperform single-modal models significantly in credibility prediction tasks.

Implications and Further Research

The $\mathsf{ReCOVery}$ repository stands at a pivotal intersection of computational linguistics, social network analysis, and misinformation research, offering actionable insights into the dynamics of fake news propagation and its real-world implications. The availability of multimodal data accentuates the prospect of understanding misinformation through cross-disciplinary lenses.

Future developments in AI that leverage datasets like $\mathsf{ReCOVery}$ are anticipated to refine misinformation detection, allowing more proactive responses to infodemics. Continued adaptation of machine learning techniques, combined with multi-language dataset expansions, could strengthen the robustness of misinformation detection across different media landscapes, preventing potential socio-political ramifications.

In summary, $\mathsf{ReCOVery$ offers a structured and scalable approach for combating misinformation with rigorous, data-driven methodologies. As scholarly efforts persist in addressing the global challenge of misinformation, such repositories provide a critical foundation for enhancing the reliability of public information in the face of future crises.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Xinyi Zhou (33 papers)
Apurva Mulay (1 paper)
Emilio Ferrara (197 papers)
Reza Zafarani (18 papers)

Citations (170)

View on Semantic Scholar