Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Automatic Detection of Fake News (1708.07104v1)

Published 23 Aug 2017 in cs.CL
Automatic Detection of Fake News

Abstract: The proliferation of misleading information in everyday access media outlets such as social media feeds, news blogs, and online newspapers have made it challenging to identify trustworthy news sources, thus increasing the need for computational tools able to provide insights into the reliability of online content. In this paper, we focus on the automatic identification of fake content in online news. Our contribution is twofold. First, we introduce two novel datasets for the task of fake news detection, covering seven different news domains. We describe the collection, annotation, and validation process in detail and present several exploratory analysis on the identification of linguistic differences in fake and legitimate news content. Second, we conduct a set of learning experiments to build accurate fake news detectors. In addition, we provide comparative analyses of the automatic and manual identification of fake news.

Analyzing Automatic Detection of Fake News

The paper "Automatic Detection of Fake News" by Pérez-Rosas et al. addresses the significant challenge of identifying fake news in digital media. Given the increasing prevalence of misinformation on platforms such as social media and online news, this research explores computational methods for the automatic detection of fake content. The paper makes two primary contributions: the introduction of two newly curated datasets and the implementation of learning algorithms to develop accurate fake news detectors.

Introduction and Background

The authors underscore the relevance of the paper, noting the high traffic that fake news sites receive through social media referrals and the critical need for tools that can help discern reliable from misleading content. Prior approaches to fake news detection have often used data from satirical sources like "The Onion" or fact-checking websites such as "PolitiFact" and "Snopes." However, these sources are limited by confounding factors like humor and domain specificity.

Datasets and Methodology

The paper focuses on the construction of two novel datasets for varied and accurate analysis. One dataset, FakeNewsAMT, was collected from six domains (sports, business, entertainment, politics, technology, and education) through a combination of manual and crowdsourced annotation. The second dataset, Celebrity, was derived from web sources focusing on celebrity news due to its high susceptibility to rumors and fake reports.

  • FakeNewsAMT Dataset: Comprised of news articles evaluated through Amazon Mechanical Turk to generate fake versions while maintaining the journalistic style.
  • Celebrity Dataset: Collected from entertainment websites with rigorous cross-referencing to verify the authenticity of legitimate news and identify fake news without major overlaps.

Features and Experimental Setup

The paper employs multiple sets of linguistic features to build their detection models:

  • Ngrams: Unigrams and bigrams based on tf-idf values.
  • Punctuation: Usage metrics derived from LIWC.
  • Psycholinguistic Features: Proportions of words categorized by LIWC into cognitive processes, emotional tone, etc.
  • Readability Metrics: Metrics like Flesch-Kincaid, Gunning Fog, among others.
  • Syntax: Context-Free Grammar (CFG) derived features.

The classifiers, developed using linear SVMs, were tested using five-fold cross-validation to ensure robustness.

Results

The experimentation shows promising results, with different feature sets obtaining high accuracy:

  • FakeNewsAMT: Best model used readability features, achieving an accuracy of up to 78%.
  • Celebrity Dataset: All features combined resulted in an accuracy of 73%.

The paper also highlights learning curves indicating performance improvement with more training data, and a comparative analysis showing computational models outperforming human detection in more diverse news domains while humans excelled in celebrity news.

Cross-Domain Analysis and Insights

An interesting aspect of the paper is the cross-domain analysis. The variability in performance when models trained on one domain were tested on another suggests unique linguistic properties tied to different kinds of news. Political, educational, and technological news were found to generalize better, while domains like sports and entertainment did not, suggesting structural differences in deceptive content across domains.

Human Baseline Performance

The human baseline performance was established through annotation tasks, revealing moderate agreement (Kappa) and a relatively high accuracy of 70% for FakeNewsAMT and 73% for the Celebrity dataset. The comparison validated the high efficacy of the automated models, which in certain domains surpassed human performance.

Implications and Future Directions

This research provides meaningful insights into the underlying linguistic differences between fake and real news. The success of readability and psycholinguistic features hints at deeper cognitive patterns distinguishable by computational methods. Future developments could focus on expanding these datasets, improving cross-domain learning, and integrating more nuanced linguistic and stylistic cues to enhance detection models further.

The paper's contributions are significant for both theoretical advancements in understanding deception linguistics and practical applications in the field of automated misinformation detection. As fake news detection continues to be of paramount importance in maintaining the integrity of information in digital communication, studies like these demonstrate the potential of leveraging computational techniques to address complex problems in natural language understanding.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Bennett Kleinberg (35 papers)
  2. Alexandra Lefevre (1 paper)
  3. Rada Mihalcea (131 papers)
  4. Verónica Pérez-Rosas (15 papers)
Citations (745)