SV-FEND: Multimodal Short Video Fake News Model
- SV-FEND is a multimodal fake news detection model that combines text, audio, visuals, and social cues to address the heterogeneous nature of short videos.
- It employs co-attention and self-attention mechanisms to selectively integrate cross-modal correlations and enhance context understanding.
- Empirical results on the FakeSV dataset demonstrate significant accuracy improvements over prior single- and multi-modal approaches.
SV-FEND (Short Video Fake news dEtectioN moDel) is a multimodal neural architecture developed for the detection of fake news on short video platforms, specifically addressing the challenges posed by multimodal, heterogeneous, and context-rich content typical to services such as Douyin and Kuaishou. SV-FEND explicitly models cross-modal correlations and integrates social context, surpassing prior single- and multi-modal approaches by selecting informative features and leveraging cues from user interaction and publisher profiles. The model is introduced alongside FakeSV, the largest Chinese short video fake news dataset to date, in the context of practical detection tasks involving news videos, comments, and social metadata (Qi et al., 2022).
1. Motivation and Challenges
SV-FEND targets two significant challenges in short-video fake news detection:
- Modal-Rich but Noisy Content: Short-video news is comprised of six inherently heterogeneous modalities—textual title and transcript, audio track, static frames, motion clips, user comments, and publisher profile. Prior approaches that simply concatenate modality features risk overlooking cross-modal signaling and may overfit to uninformative components. SV-FEND addresses this by employing co-attention modules for explicit pairwise correlation modeling and selective feature retention.
- Weak Visual Discriminability: The prevalence of advanced video editing (text-boxes, filters, splicing) in both legitimate and fake posts diminishes the discriminative power of visual content alone. SV-FEND supplements content-based representations with social context, including weighted user comments and publisher metadata, integrating these in a unified self-attention mechanism. This enables skeptical commentary or signals of low publisher authority to compensate where content is ambiguous.
2. Architecture Overview
SV-FEND's architecture comprises four sequential components:
- Multimodal Feature Extraction:
- Textual encoding via BERT for combined title and transcript.
- Audio encoding via VGGish for mel-spectrogram patches.
- Static frame features via VGG19.
- Motion clips encoded by C3D.
- User comment embeddings via BERT, weighted by up-votes.
- Publisher profile embedding via BERT on verified/self-introduction text.
- Cross-Modal Correlation (Co-Attention):
- Two-stream co-attention Transformer blocks, first between text and audio, then between the text-enhanced representation and static frames.
- Encoded outputs are cross-filtered representations, enhancing informative synergy among pairs.
- Social-Context Fusion (Self-Attention):
- Single-vector collapse via average pooling for each content modality post co-attention.
- Concatenation of all six modalities (three content, three social), with dimension equalization to , as input to a standard Transformer encoder layer (self-attention).
- Production of a unified fused vector via sequence average pooling.
- Classification:
- Output fused vector is processed by a one-hidden-layer MLP and final softmax for binary real/fake probability prediction.
| Stage | Input Modalities | Encoder(s) |
|---|---|---|
| Feature Extraction | Title+Transcript, Audio, Static Frames, Motion Clips | BERT, VGGish, VGG19, C3D |
| Social Context | User Comments, Publisher Profile | BERT |
| Correlation/Fusion | All modalities | Co-Attention, Self-Attention Transformers |
3. Modality Encoding and Input Construction
Each modality is independently pre-encoded to form vector sequences of standard dimension:
- Text (Title + Transcript): Concatenation of short title (≤33 tokens) and transcript (≤211 tokens), BERT-encoded to .
- Audio: Up to 50 mel-spectrogram patches, with each patch encoded via VGGish to .
- Static Frames: ≤83 uniformly sampled video frames, with VGG19 (fc7) outputting .
- Motion Clips: For each frame, a corresponding 16-frame clip is encoded via C3D (fc7), yielding , with mean pooling to a single vector .
- Comments: Top comments, BERT-encoded and weighted by up-votes as , where .
- Publisher Profile: Self-introduction and verified status, BERT-encoded; the [CLS] output forms .
All vectors are either zero-padded or linearly projected to standardized dimensions when aggregated for self-attention.
4. Cross-Modal Correlation and Feature Selection
SV-FEND’s co-attention is adapted from ViLBERT, instantiated as single- or multi-head attention:
For text and audio features projected to a common dimension , co-attention computes:
with analogous formulation for audio attending to text, and subsequently, text to static-frame co-attention. The key effect is selective amplification of modality pairs exhibiting informative correlations.
After co-attention, feature selection is achieved by average pooling over the sequence dimension:
yielding compact, cross-modal-filtered representations for downstream fusion.
5. Social Context Integration and Fusion
The three content vectors (, , ) and three social vectors (, , ) are concatenated as
This sequence is input to a one-layer Transformer encoder with self-attention and feed-forward sublayers, producing . The final fused feature is obtained via average pooling across the six positions:
This architecture allows global interactions among content and social modalities, enabling the model to aggregate both original and contextual cues.
6. Training Regimen and Objective
SV-FEND employs binary cross-entropy loss over the softmax outputs:
with as the binary label. The total objective contains only this classification loss; no auxiliary, alignment, or adversarial losses are introduced.
Key Hyperparameters:
- Optimizer: Adam (, )
- Learning rate:
- Batch size: 16 (balanced)
- Training epochs: up to 30, early stopping on dev set
- Co-attention heads: , dimension
- Self-attention heads:
- Max: frames, audio patches, comments
This setup is designed to balance effective learning of cross-modal and contextual signals with robust generalization.
7. Empirical Performance and Ablations
On the FakeSV benchmark (five-fold event-split cross-validation), SV-FEND achieves:
- Accuracy:
- Macro-F1:
SV-FEND surpasses all single-modality baselines (text-BERT: ) and previously leading multimodal approaches (best prior: 75.07\%).
Ablation results demonstrate:
- Removal of all news content: accuracy drops to (−4.42).
- Removal of all social context: accuracy drops to (−0.69).
- Content ablation: text removal has the greatest effect (), followed by visual (frames+clips, ), then audio ().
- Social ablation: user profile removal yields , comments .
Temporal splits (train on earlier of events, test on future ) yield accuracy, indicating robustness to emerging fake news events.
Summary Table: SV-FEND vs. Baselines (FakeSV Benchmark)
| Model | Accuracy (%) | Macro-F1 (%) |
|---|---|---|
| SV-FEND | 79.31 | 79.24 |
| Best prior (multi-modal) | ~75.07 | – |
| Best text (BERT) | 76.82 | – |
The empirical results verify the utility of two-stage attention mechanisms for selecting and integrating informative cues from both multimodal content and social context. Removal of individual components quantifies their unique contribution, with textual content being most critical among content features and publisher profile among social signals.
In totality, SV-FEND advances multimodal fake news detection on short video platforms via hierarchical attention and context-aware social signal integration, establishing a new state of the art on the FakeSV dataset (Qi et al., 2022).