r/Fakeddit: A New Multimodal Benchmark Dataset for Fine-grained Fake News Detection (1911.03854v2)

Published 10 Nov 2019 in cs.CL, cs.CY, and cs.IR

Abstract: Fake news has altered society in negative ways in politics and culture. It has adversely affected both online social network systems as well as offline communities and conversations. Using automatic machine learning classification models is an efficient way to combat the widespread dissemination of fake news. However, a lack of effective, comprehensive datasets has been a problem for fake news research and detection model development. Prior fake news datasets do not provide multimodal text and image data, metadata, comment data, and fine-grained fake news categorization at the scale and breadth of our dataset. We present Fakeddit, a novel multimodal dataset consisting of over 1 million samples from multiple categories of fake news. After being processed through several stages of review, the samples are labeled according to 2-way, 3-way, and 6-way classification categories through distant supervision. We construct hybrid text+image models and perform extensive experiments for multiple variations of classification, demonstrating the importance of the novel aspect of multimodality and fine-grained classification unique to Fakeddit.

Authors (3)

Kai Nakamura (5 papers)
Sharon Levy (22 papers)
William Yang Wang (254 papers)

Citations (111)

View on Semantic Scholar

Summary

Overview of the r/Fakeddit Dataset Paper

The prevalence of fake news across digital platforms necessitates robust detection mechanisms to mitigate its societal impact. The paper "r/Fakeddit: A New Multimodal Benchmark Dataset for Fine-grained Fake News Detection" addresses a significant gap in the current fake news detection landscape, primarily the lack of multimodal datasets that encompass both text and image data. Traditional datasets have largely been limited to textual content, which constrains their applicability in real-world scenarios where fake news is often disseminated through multimedia.

Key Contributions

Introduction of the Fakeddit Dataset: The paper presents Fakeddit, a multimodal dataset with over 1 million samples sourced from Reddit. This dataset includes text, images, metadata, and comments data, offering a comprehensive foundation for both high-level and fine-grained fake news classification across 2-way, 3-way, and 6-way categories.
Multimodal Approach: A significant contribution is the focus on multimodality by combining text and image features. This approach is critical in accurately detecting fake news that relies on visual deception alongside misleading text.
Hierarchical Labeling: The dataset allows for hierarchical classification, with labels assigned for binary (true vs. false), trinary (completely true, fake with true text, and fake with false text), and six distinct fine-grained categories of fake news such as satire, manipulated content, and imposter content. This stratification supports nuanced fake news analysis and model training.
Baseline Models and Experimental Validation: The authors implemented hybrid text+image neural network models and conducted extensive experiments. The results indicate substantial improvements in detection accuracy when using multimodal data, with BERT and ResNet50 models yielding the best performance across classification tasks. The "maximum" combination method was found to be most effective in merging text and image features.

Numerical Results

The experimental section provides quantitative insights demonstrating the efficacy of multimodal features. Specifically, the combination of BERT text features and ResNet50 image features achieved a test accuracy of 89.09% for 2-way classification, 88.90% for 3-way classification, and 85.88% for 6-way classification—highlighting the importance of multimodal integration in enhancing detection robustness.

Implications and Future Work

Practical: The availability of the Fakeddit dataset can significantly advance the development and benchmarking of fake news detection algorithms, especially those leveraging multimodal data. It aligns well with real-world applications where fake news is distributed across various media forms.

Theoretical: The findings emphasize the need for further research in multimodal learning to capture the complexities of deceitful content. The introduction of hierarchical and fine-grained classification categories encourages more refined model training that can differentiate subtle differences in fake news types.

Speculative Future Directions: The paper paves the way for further exploration into incorporating additional data types such as user interaction patterns and content sharing networks. There is also potential for exploring methods to quantify and mitigate bias in multimodal learning systems for fake news detection. Researchers may also investigate integrating video analysis into multimodal systems, considering the increasing use of video content in social media misinformation.

In conclusion, the paper introduces Fakeddit as a pioneering dataset that addresses significant limitations of existing resources, offering a more holistic tool for fake news detection research. The emphasis on multimodal analysis and fine-grained classification is particularly relevant in evolving contexts where misinformation in various formats poses a persistent challenge.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos