Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Some Like it Hoax: Automated Fake News Detection in Social Networks (1704.07506v1)

Published 25 Apr 2017 in cs.LG, cs.HC, and cs.SI

Abstract: In recent years, the reliability of information on the Internet has emerged as a crucial issue of modern society. Social network sites (SNSs) have revolutionized the way in which information is spread by allowing users to freely share content. As a consequence, SNSs are also increasingly used as vectors for the diffusion of misinformation and hoaxes. The amount of disseminated information and the rapidity of its diffusion make it practically impossible to assess reliability in a timely manner, highlighting the need for automatic hoax detection systems. As a contribution towards this objective, we show that Facebook posts can be classified with high accuracy as hoaxes or non-hoaxes on the basis of the users who "liked" them. We present two classification techniques, one based on logistic regression, the other on a novel adaptation of boolean crowdsourcing algorithms. On a dataset consisting of 15,500 Facebook posts and 909,236 users, we obtain classification accuracies exceeding 99% even when the training set contains less than 1% of the posts. We further show that our techniques are robust: they work even when we restrict our attention to the users who like both hoax and non-hoax posts. These results suggest that mapping the diffusion pattern of information can be a useful component of automatic hoax detection systems.

Citations (383)

Summary

  • The paper demonstrates a novel approach to automated fake news detection by classifying social network posts based solely on user interaction data (likes).
  • A harmonic boolean crowdsourcing algorithm achieved over 99% accuracy with minimal training data (0.5%), and logistic regression also demonstrated high efficacy over 90%.
  • Achieving high accuracy with limited training data suggests potential for scalable implementation on large social networks and transferability across different user communities.

Overview of "Some Like it Hoax: Automated Fake News Detection in Social Networks"

The paper "Some Like it Hoax: Automated Fake News Detection in Social Networks" explores the development of a framework capable of detecting hoaxes on Social Network Sites (SNSs) using user interaction data, specifically likes on Facebook. In addressing the growing proliferation of misinformation, this paper provides an innovative approach to classifying posts as hoaxes or non-hoaxes based on the users who engage with them. The authors demonstrate how applying logistic regression and a novel adaptation of boolean crowdsourcing techniques can achieve remarkably high classification accuracy, even with minimal training data.

Classification Techniques and Results

Two primary classification techniques are employed: logistic regression and a novel boolean crowdsourcing algorithm. The logistic regression utilizes user interaction, treating individual users as features that influence whether a post is classified as a hoax or not. Specifically, the regression model learns weights for each user, indicating a propensity for liking either hoax or non-hoax posts.

The boolean crowdsourcing algorithm adapts traditional crowdsourcing mechanisms by discarding the assumption of user reliability. Instead, it leverages a training set of known truthful and deceptive posts to inform its classification metrics. This approach successfully integrates user interactions to infer the nature of unseen posts through iterative updates across a bipartite graph representing likes.

Significant numerical results emerge from this paper. The harmonic version of the boolean crowdsourcing algorithm attains classification accuracies exceeding 99% when trained on just 0.5% of the dataset (approximately 80 posts). Similarly, while logistic regression underperforms compared to the harmonic algorithm when dataset overlap is limited, it still sustains high efficacy with over 90% accuracy, validating its suitability for this application.

Dataset and Methodology

The paper's dataset is derived from Facebook posts during the latter half of 2016, encompassing interactions from over 900,000 users across 15,500 posts. Posts are categorized based on their origination from either scientific or conspiracy-themed pages, with the former assumed non-hoaxes and the latter hoaxes. Despite potential biases in this labeling method, the paper makes a strong case for effectively classifying new posts by the nature of their audience's interactions.

The examination of two specific datasets—the complete dataset and an intersection dataset focusing on users liking both hoax and non-hoax posts—underscores the paper's robustness in classifying diverse user communities beyond polarized groups. The results indicate logistic regression's superior performance on this intersection dataset, suggesting contexts where user liking behavior without strong biases can still inform reliable classifications.

Implications and Future Directions

The successful application of user interaction data in hoax classification introduces substantial implications for enhancing automatic fake news detection systems in SNSs. The ability to accurately classify with limited training exemplifies potential scaling to large social networks with manageable manual classification efforts. Additionally, the transferability of learning across different user communities signals opportunities for algorithmic refinement and application in broader contexts beyond the specific paper dataset.

Future exploration could further probe the adaptability of these techniques across varied social platforms, enriching the toolkit available for combatting misinformation. Continued advancement in this research vein may also inspire similar approaches for different types of content moderation challenges, enhancing the alignment of technical solutions with practical and ethical considerations in digital ecosystems.