FakeNewsNet Dataset
- FakeNewsNet is a large-scale, multi-modal dataset for studying fake news on Twitter with both content and social context features.
- It aggregates verified labels from PolitiFact and GossipCop, enabling analyses such as detection, early warning, and diffusion modeling.
- Its unified schema and extensive metadata support experiments with models like SVMs, CNNs, and fusion strategies, improving detection performance.
FakeNewsNet is a large-scale, multi-modal data repository explicitly designed for the study of fake news dissemination and detection on social media platforms, with a particular focus on the Twitter ecosystem. The dataset aggregates ground-truth labels, comprehensive news content, diverse social context data, and rich spatiotemporal information for each news item, covering both political and entertainment domains. Its unified schema and extensive feature set facilitate research across fake news detection, early warning, diffusion modeling, and intervention strategies, establishing it as a community benchmark for social media misinformation studies (Shu et al., 2018, Murayama, 2021).
1. Data Sources and Annotation Protocols
FakeNewsNet integrates two primary fact-checking sources:
- PolitiFact Subset: Political news articles verified by PolitiFact fact-checkers. Verdicts are mapped to binary labels: verdicts such as “False” and “Pants on Fire” are treated as fake, while “True” is treated as real.
- GossipCop Subset: Celebrity and entertainment news items with numerical credibility scores. Scores below five are classified as fake; all entertainment news crawled from E! Online is considered real.
All item veracity labels are inherited directly from these organizations, without additional manual annotation or crowdsourced labeling (Shu et al., 2018, Murayama, 2021).
2. Data Collection and Structure
The repository consists of 23,921 distinct news items—each tagged as real or fake—accompanied by extensive Twitter activity contemporaneous with the publication dates (approximately 2014–2018). Data collection involved daily crawls, archival methods for missing URLs, and hydration of tweet/user records via the Twitter API. The key structural elements are:
- news_content.json: Includes URL, title, publication date (ISO 8601), full text, embedded image URLs, and the news source (“PolitiFact” or “GossipCop”).
- tweets/ and retweets/: One JSON file per tweet or retweet with fields for tweet ID, user ID, timestamp, text, favorite and retweet counts, reply status, and geocoordinates if available.
- user_profiles/: Per-user JSON capturing user ID, screen name, account creation date, location, follower/friend/status counts, and verification status.
- user_timeline_tweets/: Up to 200 recent tweets per user, supporting timeline analysis.
- Network Features: Follower/followee adjacency lists, retweet/reply propagation graphs.
Additional features are available as CSV via the API, including reply sentiment (VADER), per-user bot scores (Botometer), and network aggregates (Shu et al., 2018).
3. Modalities and Feature Space
FakeNewsNet’s multi-modal schema supports analysis from multiple perspectives:
- Textual Features: Detailed article title and body, URL, publisher, and time stamp.
- Visual Features: Images referenced by the news article.
- Social Context: Full engagement data for tweets sharing, reacting to, and propagating each news item, covering likes, retweets, replies, unique user attributes, and sentiment.
- Spatiotemporal Features: Time-stamped tweet and user locations allow tracing event propagation chains and diffusion patterns.
The repository leverages Twitter’s standard JSON format for compatibility; propagation graphs and follower/followee networks enable graph-based modeling of information spread (Murayama, 2021, Shu et al., 2018).
4. Statistical Overview and Exploratory Analysis
Aggregate statistics reflect the scale and diversity of the repository. Table 1 summarizes key metrics for both PolitiFact and GossipCop subsets:
| Category | PolitiFact Fake | PolitiFact Real | GossipCop Fake | GossipCop Real |
|---|---|---|---|---|
| News pieces | 432 | 624 | 5,323 | 16,817 |
| Tweeting users | 95,553 | 249,887 | 265,155 | 80,137 |
| Tweets posting news | 164,892 | 399,237 | 519,581 | 876,967 |
| Replies to news-tweets | 11,975 | 41,852 | 39,717 | 11,912 |
| Likes on news-tweets | 31,692 | 93,839 | 96,906 | 41,889 |
| Retweets of news-tweets | 23,489 | 67,035 | 56,552 | 24,955 |
| Avg. followers per user | 1,299.98 | 982.67 | 1,020.99 | 933.64 |
Exploratory findings include differences in content themes (political vs. celebrity), account age distributions (p < 0.05, t-test), and bot prevalence (~22% of users spreading fake news flagged as bots vs. ~9% in real news, threshold 0.5). Fake-news cascades exhibit retweet spikes while real news shows more gradual growth; sentiment analysis reveals higher negative-reply ratios for fake news. Geographic propagation patterns are also distinct for fake and real stories (Shu et al., 2018).
5. Benchmarks and Evaluation Protocols
FakeNewsNet establishes baselines for three core tasks:
- Fake-news detection (content and social features)
- Early detection (temporal patterns)
- Propagation modeling (diffusion and mitigation analysis)
Models evaluated include SVMs, Logistic Regression, Naive Bayes, CNNs, and Social Article Fusion (auto-encoders and LSTM-based sequence models). Standard metrics are accuracy, precision, recall, and F1 score. Table 2 samples performance for selected methods:
| Model | PolitiFact Acc | F1 | GossipCop Acc | F1 |
|---|---|---|---|---|
| SVM (text only) | 0.580 | 0.659 | 0.497 | 0.595 |
| CNN (text only) | 0.629 | 0.583 | 0.723 | 0.725 |
| Social Article Fusion /S | 0.654 | 0.681 | 0.689 | 0.703 |
| Social Article Fusion (S+A) | 0.691 | 0.706 | 0.689 | 0.717 |
Precision, recall, and F1 are defined as:
A plausible implication is that fusion strategies combining content and social features outperform standalone text or temporal models.
6. Usage, Licensing, and Accessibility
FakeNewsNet is distributed via GitHub (https://github.com/KaiDMML/FakeNewsNet), providing CSV listings, Python hydration scripts, and modular feature-extraction tools. Users must supply their own Twitter API credentials for tweet/user hydration due to Twitter Developer Policy constraints; non-commercial research use is recommended. The code is released under an MIT license, while data access is governed by Twitter’s policies (Shu et al., 2018, Murayama, 2021).
A typical integration pipeline involves:
- Cloning the repository and configuring API credentials.
- Selecting the desired subset (PolitiFact or GossipCop) and feature set.
- Hydrating tweet and user objects using provided scripts.
- Loading JSON outputs into analysis pipelines (e.g., pandas, PyTorch, TensorFlow).
- Extracting features such as sentiment, bot scores, and social aggregates.
No other social media platforms are currently included, and planned updates intend to expand fact-checker coverage and improve collection methodologies (Shu et al., 2018, Murayama, 2021).
7. Limitations and Research Challenges
Principal limitations include:
- Temporal Drift: All engagement data reflect the state at collection time; deleted tweets or accounts are unrecoverable, impacting reproducibility.
- Platform Bias: Exclusively Twitter-based; findings may not generalize to Facebook, Weibo, Reddit, or other systems.
- Class Imbalance and Domain Skew: Disparities in the proportion of fake/real and political/entertainment news can induce bias in models unless explicitly controlled.
These constraints foreground key research challenges in dataset construction, propagation modeling, and generalizability of fake news detection strategies (Murayama, 2021).
FakeNewsNet’s comprehensive, multi-modal architecture continues to facilitate robust research in social media misinformation, supporting diverse methodologies and providing a platform for comparative analysis and model benchmarking (Shu et al., 2018, Murayama, 2021).