FakeCovid -- A Multilingual Cross-domain Fact Check News Dataset for COVID-19 (2006.11343v1)

Published 19 Jun 2020 in cs.CY and cs.SI

Abstract: In this paper, we present a first multilingual cross-domain dataset of 5182 fact-checked news articles for COVID-19, collected from 04/01/2020 to 15/05/2020. We have collected the fact-checked articles from 92 different fact-checking websites after obtaining references from Poynter and Snopes. We have manually annotated articles into 11 different categories of the fact-checked news according to their content. The dataset is in 40 languages from 105 countries. We have built a classifier to detect fake news and present results for the automatic fake news detection and its class. Our model achieves an F1 score of 0.76 to detect the false class and other fact check articles. The FakeCovid dataset is available at Github.

PDF Abstract

A Thorough Examination of the "FakeCovid" Dataset for Multilingual Cross-Domain Fact-Checking in COVID-19 News

The paper presents a significant contribution to misinformation detection during the COVID-19 pandemic through the creation of the "FakeCovid" dataset. This work addresses the substantial challenge of an infodemic, as termed by the World Health Organization, which describes the dissemination of copious amounts of misinformation paralleling the spread of the virus itself. The dataset compiles 5,182 fact-checked news articles related to COVID-19, spanning 40 languages and 105 countries, thus offering a comprehensive multilingual and cross-domain resource for analyzing false news dissemination.

Methodology and Dataset Creation

The authors collected articles from 92 distinct fact-checking websites, leveraging resources like Poynter and Snopes. Each article was annotated into one of 11 categories reflecting the content and nature of the misinformation, ranging from conspiracy theories to international responses concerning COVID-19. Notably, the dataset includes a diverse array of languages, with 40.8% of articles in English, allowing for robust multilingual analysis.

Data collection and preprocessing involved standard techniques, including scraping content using Python libraries and addressing issues such as incorrect URLs and duplicate entries. The annotation process was rigorous, leveraging skilled annotators with linguistic expertise to manually categorize articles, ensuring high intercoder reliability.

Automatic Fake News Detection

The dataset was utilized to build and evaluate a machine learning-based classifier aimed at detecting false news. This classifier achieved an F1-score of 0.76 for distinguishing articles labeled as false, emphasizing the practical utility in filtering misinformation. The binary classification approach—differentiating between 'false' and 'other' categories—streamlines early screening processes, which is crucial given the dataset's dominance of articles labeled 'false.'

Implications and Future Research Directions

The FakeCovid dataset and corresponding classifier model offer several implications for combating misinformation. Practically, the dataset can inform the development of more sophisticated automated tools to aid fact-checkers, reducing the resource-intensive nature of manual verification. Theoretically, the dataset provides a foundation for studying the dynamics of misinformation spread, as well as the potential to refine classifiers for improved accuracy and language coverage.

Future research could delve into the propagation pathways of false news across various social media platforms, potentially leveraging linked data from platforms like Twitter. Additionally, augmenting the dataset with metadata like schema markup or constructing knowledge graphs could optimize misinformation detection and enhance subsequent machine comprehension and user query resolution.

Conclusion

This research fills a crucial gap in resources available for misinformation detection during the COVID-19 pandemic. By providing a dataset that is both multilingual and sourced from a broad array of fact-checkers, it lays the groundwork for future developments in automated fact-checking technologies, helping to address the challenges posed by rapidly evolving and diverse misinformation landscapes. The open availability of the dataset facilitates ongoing exploration and improvement in the domain of misinformation detection, ultimately contributing to more informed public discourse in times of crisis.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Gautam Kishore Shahi (19 papers)
Durgesh Nandini (7 papers)

Citations (199)

View on Semantic Scholar