FakeSV: Multimodal Benchmark for Fake News
- FakeSV is a comprehensive multimodal benchmark dataset designed to detect fake news on Chinese short-video platforms by integrating video, audio, text, and social signals.
- It employs rigorous annotation protocols with high inter-annotator agreement (Cohen’s κ = 0.89) by cross-referencing fact-checking sources to ensure reliable labeling.
- The dataset supports diverse splitting strategies and fusion models, providing actionable insights for advancing research in multimodal misinformation detection.
FakeSV is a large-scale, multimodal benchmark dataset constructed for the paper and evaluation of fake news detection on Chinese short-video platforms. Designed to systematically capture the complex interplay between content signals and social context, FakeSV supports fine-grained analysis and algorithmic innovation for the detection of misinformation in rich media environments. The dataset is notable for its breadth—comprising video, audio, text, user comments, and publisher metadata—and for rigorous annotation protocols that enable both content-based and context-aware modeling.
1. Dataset Construction and Scope
FakeSV was introduced to address fundamental gaps in multimodal fake-news detection research, specifically the scarcity of public short-video benchmarks and the need for datasets integrating diverse content and extensive social signals (Qi et al., 2022). Data collection targeted two of the largest Chinese short-video platforms, Douyin and Kuaishou, focusing on content spanning January 2011 to early 2022. Fact-checking websites provided event keywords and debunked-claim seeds; these informed the crawling and annotation of video samples. The resulting corpus consists of:
- Videos: Raw short video clips (≤5 minutes), sampled keyframes, and cover images.
- Audio: Complete original track for each video.
- Textual modalities: Titles, on-screen text via OCR, and ASR speech transcripts.
- Social context: Up to 100 top user comments per video (with like-counts and reply statistics), plus publisher profiles featuring verification status, fan/follower counts, self-introduction, and historical publishing activity.
FakeSV includes three label categories: Fake, Real, and Debunked, with the binary Fake vs. Real task being canonical for most detection experiments.
| Label | Count |
|---|---|
| Fake | 1,827 |
| Real | 1,827 |
| Debunked | 1,884 |
| Total | 5,538 |
2. Annotation Protocol and Quality Control
Label assignment in FakeSV is performed by cross-referencing video content with established fact-checking portals (e.g., Weibo Community Management, Tencent Jiaozhen, China Fact Check) (Qi et al., 2022). For the primary release, nine postgraduate annotators underwent protocol training followed by a two-pass review involving first and second authors. Consensus mechanisms resolved ambiguous cases; “Other” labels were assigned for irreducible uncertainty. Inter-annotator agreement, as quantified by Cohen’s κ, reached 0.89—indicative of “almost perfect” agreement.
Labels follow strict criteria:
- Fake: Video and title jointly present a claim previously debunked and not supported by authoritative news sources.
- Real: Verified by independent, reputable news reports.
- Debunked: Videos concerning claims that have been debunked but not directly reused as positive Fake examples.
3. Multimodal Structure and Feature Representation
Each FakeSV instance incorporates multiple modalities:
- Visual: Video keyframes (sampled, cover), resized and normalized for feature extraction (e.g., VGG-19).
- Audio: Original track, resampled and converted into log-mel spectrograms; encoded with models such as Wav2Vec.
- Text: Tokenized and embedded using pre-trained LLMs (e.g., BERT-base-Chinese), encompassing both OCR outputs and manual/human-edited captions.
- Comments: Each comment embedded using BERT, weighted by like-count, and aggregated by formulas such as
- Publisher Metadata: Quantitative fields (fan counts, etc.) are min-max normalized.
These modalities are supplied to models either as individual streams or through hierarchical fusion networks that employ cross-modal attention mechanisms (Yan et al., 12 Jan 2025, Qi et al., 2022).
4. Dataset Splits, Evaluation Protocol, and Statistics
FakeSV supports several data splitting strategies:
- Chronological/Timestamped: 70% train, 15% validation, 15% test, with strict temporal hold-out to mimic real-world deployment (Yan et al., 12 Jan 2025, Bu et al., 23 Jul 2024).
- Event-level K-fold: Five-fold cross-validation at the event description level, ensuring that test sets comprise unseen events (Qi et al., 2022).
Rounded example split counts (for ):
| Split | #Videos |
|---|---|
| Train | 3,876 |
| Val | 831 |
| Test | 831 |
Key observed data statistics (Fake vs. Real sets):
| Statistic | Fake | Real |
|---|---|---|
| Avg. title length | 22 chars | 35 chars |
| % Empty titles | 12% | 5% |
| % with comments | 68% | 75% |
| Avg. #comments (top100) | 58 | 62 |
| Publisher verified | 15% | 75% |
| Median fan count | 1,200 | 12,000 |
| Video 5min | 100% | 100% |
There is no class imbalance when the canonical "Fake" vs. "Real" split is used.
5. Analytical Insights and Modality-Specific Patterns
FakeSV’s layered annotation enables exploratory analyses revealing modality-specific trends (Qi et al., 2022, Bu et al., 23 Jul 2024):
- Textual features: Fake videos have shorter and more colloquial/emotional titles ("OMG," frequent questions). Emotion lexicon scoring reveals higher “like” and “surprise” in fakes.
- Visual features: NIQE metrics demonstrate significantly lower visual quality in fake videos (statistically significant at ).
- Audio features: Increased concentration of high-arousal emotion classes in fake speech.
- Publisher and social context: Fake videos disproportionately originate from unverified publishers (15% verified vs. 75% for reals). Fakes evoke a higher rate of “doubtful” comments (18% vs. 4% for real). Publisher profiles for fakes tend toward higher consumption and lower production metrics.
- Temporal/propagation patterns: 39% of fake videos appear after official debunking events; cover-image duplication rates are elevated in fakes.
Creative-process based analysis (Bu et al., 23 Jul 2024) further exposes that fake news videos in FakeSV exhibit:
- Higher variance in audio-emotion logits.
- Lower text–visual semantic alignment, as measured by JS divergence between CLIP-encoded frame and text distributions.
- Less color-rich and spatially refined on-screen text.
- More monotonous temporal text-exposure patterns.
6. Benchmark Tasks, Methodologies, and Performance
FakeSV serves as a benchmark for multiple detection methodologies (Qi et al., 2022, Yan et al., 12 Jan 2025, Bu et al., 23 Jul 2024), supporting tasks such as multimodal fake news detection, social-context analysis, and creative-process-aware modeling.
Models ingest all available modalities (video, audio, text, comments, publisher profiles). Standard feature extraction pipelines include BERT for text, VGG-19 for images, and Wav2Vec for audio. Hierarchical and co-attention fusion mechanisms are employed to exploit cross-modal interactions. Losses are computed at multiple fusion levels in architectures such as MTPareto (Yan et al., 12 Jan 2025). Ablation studies and cross-modal selection strategies demonstrate non-trivial accuracy improvements when all modalities are employed.
Performance on the core binary task (Fake vs. Real), as reported across various methods and splits:
| Model | Acc (%) | F1 (%) |
|---|---|---|
| BERT (text) | 76.8 | 76.8 |
| MyVC (text+img+comments) | 75.1 | 75.0 |
| TT (video+audio) | 75.0 | 75.0 |
| SV-FEND (all modals) | 79.3 | 79.2 |
| MTPareto | 84.50 | 84.15 |
| FakingRecipe | 85.35 | 84.83 |
Reported metrics: Accuracy, Precision, Recall, and F₁-score, as well as macro-F1 for imbalanced or multiclass tasks.
7. Challenges, Limitations, and Future Directions
FakeSV’s real-world focus, multimodal richness, and fine-grained labels establish it as a challenging benchmark for fake news detection research. However, several open challenges persist:
- Domain adaptation: Short-video specific artifacts, linguistic features (Chinese), and social context differ from text/news-image datasets and from Western media.
- Generalization: Emerging event types and new content forms may induce "concept drift." Chronological evaluation mitigates but does not eliminate this concern.
- Annotation: Although inter-annotator agreement is high, some cases ("Other" or "Debunked") remain under-characterized for certain downstream uses.
- Creative process cues: The importance of cross-modal semantic alignment, editing traces, and emotional inference in fakes suggests new detection paradigms and potential for transfer learning.
- Integration with singing-voice deepfake (Fake Song Detection): While not the same domain, insights from FSD regarding domain-specific model training (Xie et al., 2023) highlight the inadequacy of speech-trained ADD baselines for genre-adapted tasks, a consideration that may carry into future FakeSV expansions.
A plausible implication is that future benchmarks will further increase modality diversity, introduce crosslingual evaluation, and provide annotation for nuanced video edits and composition processes. The empirical superiority of process-aware detection (as shown by FakingRecipe's ~5% accuracy increase over previous SOTA (Bu et al., 23 Jul 2024)) underscores the ongoing need for both dataset and methodological innovation.