"Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection (1705.00648v1)

Published 1 May 2017 in cs.CL and cs.CY

Abstract: Automatic fake news detection is a challenging problem in deception detection, and it has tremendous real-world political and social impacts. However, statistical approaches to combating fake news has been dramatically limited by the lack of labeled benchmark datasets. In this paper, we present liar: a new, publicly available dataset for fake news detection. We collected a decade-long, 12.8K manually labeled short statements in various contexts from PolitiFact.com, which provides detailed analysis report and links to source documents for each case. This dataset can be used for fact-checking research as well. Notably, this new dataset is an order of magnitude larger than previously largest public fake news datasets of similar type. Empirically, we investigate automatic fake news detection based on surface-level linguistic patterns. We have designed a novel, hybrid convolutional neural network to integrate meta-data with text. We show that this hybrid approach can improve a text-only deep learning model.

Citations (1,267)

View on Semantic Scholar

Summary

The paper introduces a benchmark dataset of 12,836 labeled short statements that fills a crucial gap for fake news detection research.
It details a hybrid CNN model that integrates meta-data, achieving an accuracy improvement from 0.270 to 0.274 over text-only models.
The dataset’s extensive annotations from diverse contexts support further NLP research in stance classification, argument mining, and fact-checking.

A New Benchmark Dataset for Fake News Detection

The paper entitled "‘Liar, Liar Pants on Fire’: A New Benchmark Dataset for Fake News Detection" by William Yang Wang presents a significant contribution to the field of deception detection in the context of fake news. This paper introduces the "liar" dataset, a substantial resource designed to mitigate the challenge posed by the lack of labeled benchmark datasets in fake news detection.

The liar dataset comprises 12,836 manually labeled short statements collected over a decade from PolitiFact.com. This dataset is an order of magnitude larger than previously available datasets in this domain, such as the dataset by Vlachos and Riedel, which contained only 221 statements. The increased size and detailed annotations of the liar dataset facilitate the development and benchmarking of machine learning models for fake news detection.

Dataset Composition and Features

The liar dataset contains extensive meta-data attributes including truthfulness, subject, context/venue, speaker, state, party affiliation, and prior history of deceptive statements. The labeled statements originate from various contexts like political debates, TV/radio interviews, social media posts, and news releases, ensuring a broad and realistic representation of statements. The fine-grained labeling includes six categories: pants-fire, false, barely-true, half-true, mostly-true, and true, with a well-balanced distribution among them, except for the 'pants-fire' category which is less represented.

Analytical and Empirical Contributions

The paper evaluates several machine learning models using the liar dataset, including logistic regression (LR), support vector machines (SVMs), long short-term memory networks (LSTMs), and convolutional neural networks (CNNs). The authors introduce a hybrid CNN architecture designed to integrate meta-data with text, demonstrating that this approach can enhance the performance of text-only models.

Key findings from the experimental evaluation show:

CNNs achieved the highest accuracy on the test set (0.270) among text-only models.
The hybrid CNN model, which incorporated meta-data, reached an improved accuracy of 0.274.
Meta-data attributes such as speaker affiliation, job, and context contributed to enhanced model performance, confirming the value of combining textual and speaker-related features in fake news detection.

Implications and Future Directions

The liar dataset's extensive labeling and meta-data provide a robust foundation for advancing computational approaches in fake news detection. The empirical results underline the importance of integrating diverse features from text and meta-data to improve detection accuracy.

The detailed annotations and grounded truthfulness evaluations make this dataset suitable for various other NLP tasks, such as stance classification, argument mining, topic modeling, and political NLP research. Additionally, the dataset opens avenues for exploring automatic fact-checking using external knowledge bases, which could further enhance the reliability and accuracy of fake news detectors.

In summary, this paper presents a valuable resource for the research community, enabling more sophisticated and accurate approaches to detecting fake news. The findings encourage further research into hybrid models that leverage comprehensive meta-data, and the dataset itself serves as a benchmark for future advancements in this critical area of paper.

PDF Markdown