Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 155 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 96 tok/s Pro
Kimi K2 218 tok/s Pro
GPT OSS 120B 429 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

SV-FEND: Multimodal Short Video Fake News Model

Updated 9 November 2025
  • SV-FEND is a multimodal fake news detection model that combines text, audio, visuals, and social cues to address the heterogeneous nature of short videos.
  • It employs co-attention and self-attention mechanisms to selectively integrate cross-modal correlations and enhance context understanding.
  • Empirical results on the FakeSV dataset demonstrate significant accuracy improvements over prior single- and multi-modal approaches.

SV-FEND (Short Video Fake news dEtectioN moDel) is a multimodal neural architecture developed for the detection of fake news on short video platforms, specifically addressing the challenges posed by multimodal, heterogeneous, and context-rich content typical to services such as Douyin and Kuaishou. SV-FEND explicitly models cross-modal correlations and integrates social context, surpassing prior single- and multi-modal approaches by selecting informative features and leveraging cues from user interaction and publisher profiles. The model is introduced alongside FakeSV, the largest Chinese short video fake news dataset to date, in the context of practical detection tasks involving news videos, comments, and social metadata (Qi et al., 2022).

1. Motivation and Challenges

SV-FEND targets two significant challenges in short-video fake news detection:

  1. Modal-Rich but Noisy Content: Short-video news is comprised of six inherently heterogeneous modalities—textual title and transcript, audio track, static frames, motion clips, user comments, and publisher profile. Prior approaches that simply concatenate modality features risk overlooking cross-modal signaling and may overfit to uninformative components. SV-FEND addresses this by employing co-attention modules for explicit pairwise correlation modeling and selective feature retention.
  2. Weak Visual Discriminability: The prevalence of advanced video editing (text-boxes, filters, splicing) in both legitimate and fake posts diminishes the discriminative power of visual content alone. SV-FEND supplements content-based representations with social context, including weighted user comments and publisher metadata, integrating these in a unified self-attention mechanism. This enables skeptical commentary or signals of low publisher authority to compensate where content is ambiguous.

2. Architecture Overview

SV-FEND's architecture comprises four sequential components:

  1. Multimodal Feature Extraction:
    • Textual encoding via BERT for combined title and transcript.
    • Audio encoding via VGGish for mel-spectrogram patches.
    • Static frame features via VGG19.
    • Motion clips encoded by C3D.
    • User comment embeddings via BERT, weighted by up-votes.
    • Publisher profile embedding via BERT on verified/self-introduction text.
  2. Cross-Modal Correlation (Co-Attention):
    • Two-stream co-attention Transformer blocks, first between text and audio, then between the text-enhanced representation and static frames.
    • Encoded outputs are cross-filtered representations, enhancing informative synergy among pairs.
  3. Social-Context Fusion (Self-Attention):
    • Single-vector collapse via average pooling for each content modality post co-attention.
    • Concatenation of all six modalities (three content, three social), with dimension equalization to dm=768d_m = 768, as input to a standard Transformer encoder layer (self-attention).
    • Production of a unified fused vector via sequence average pooling.
  4. Classification:
    • Output fused vector is processed by a one-hidden-layer MLP and final softmax for binary real/fake probability prediction.
Stage Input Modalities Encoder(s)
Feature Extraction Title+Transcript, Audio, Static Frames, Motion Clips BERT, VGGish, VGG19, C3D
Social Context User Comments, Publisher Profile BERT
Correlation/Fusion All modalities Co-Attention, Self-Attention Transformers

3. Modality Encoding and Input Construction

Each modality is independently pre-encoded to form vector sequences of standard dimension:

  • Text (Title + Transcript): Concatenation of short title (≤33 tokens) and transcript (≤211 tokens), BERT-encoded to HTRl×768H_T \in \mathbb{R}^{l \times 768}.
  • Audio: Up to 50 mel-spectrogram patches, with each patch encoded via VGGish to HARn×128H_A \in \mathbb{R}^{n \times 128}.
  • Static Frames: ≤83 uniformly sampled video frames, with VGG19 (fc7) outputting HIRm×4096H_I \in \mathbb{R}^{m \times 4096}.
  • Motion Clips: For each frame, a corresponding 16-frame clip is encoded via C3D (fc7), yielding HVRm×4096H_V \in \mathbb{R}^{m \times 4096}, with mean pooling to a single vector xVx_V.
  • Comments: Top k=23k=23 comments, BERT-encoded and weighted by up-votes as xC=j=1kαjcjx_C = \sum_{j=1}^k \alpha_j c_j, where αj=(j+1)/(t=1kt+k)\alpha_j = (\ell_j+1) / (\sum_{t=1}^k \ell_t + k).
  • Publisher Profile: Self-introduction and verified status, BERT-encoded; the [CLS] output forms xUR768x_U \in \mathbb{R}^{768}.

All vectors are either zero-padded or linearly projected to standardized dimensions when aggregated for self-attention.

4. Cross-Modal Correlation and Feature Selection

SV-FEND’s co-attention is adapted from ViLBERT, instantiated as single- or multi-head attention:

For text and audio features projected to a common dimension dd, co-attention computes:

QT=HTWQ,KA=HAWK,VA=HAWVQ_T = H_T W^Q, \quad K_A = H_A W^K, \quad V_A = H_A W^V

AttentionTA=softmax(QTKATdk)VA\mathrm{Attention}_{T \leftarrow A} = \mathrm{softmax}\left(\frac{Q_T K_A^T}{\sqrt{d_k}}\right)V_A

HTA=LayerNorm(HT+AttentionTAWO)H_{T \leftarrow A} = \mathrm{LayerNorm}(H_T + \mathrm{Attention}_{T \leftarrow A} W^O)

with analogous formulation for audio attending to text, and subsequently, text to static-frame co-attention. The key effect is selective amplification of modality pairs exhibiting informative correlations.

After co-attention, feature selection is achieved by average pooling over the sequence dimension:

xT=1li=1lHTA,I[i,:],xA=1nj=1nHAT[j,:],xI=1mk=1mHIT[k,:]x_T = \frac{1}{l} \sum_{i=1}^l H_{T \leftarrow A,I}[i,:],\quad x_A = \frac{1}{n} \sum_{j=1}^n H_{A \leftarrow T}[j,:],\quad x_I = \frac{1}{m}\sum_{k=1}^m H_{I \leftarrow T}[k,:]

yielding compact, cross-modal-filtered representations for downstream fusion.

5. Social Context Integration and Fusion

The three content vectors (xTx_T, xAx_A, xIx_I) and three social vectors (xVx_V, xCx_C, xUx_U) are concatenated as

X=[xT;xA;xI;xV;xC;xU]R6×768X = [x_T; x_A; x_I; x_V; x_C; x_U] \in \mathbb{R}^{6 \times 768}

This sequence is input to a one-layer Transformer encoder with self-attention and feed-forward sublayers, producing ZR6×768Z \in \mathbb{R}^{6 \times 768}. The final fused feature is obtained via average pooling across the six positions:

xm=16p=16Z[p,:]x_m = \frac{1}{6} \sum_{p=1}^6 Z[p,:]

This architecture allows global interactions among content and social modalities, enabling the model to aggregate both original and contextual cues.

6. Training Regimen and Objective

SV-FEND employs binary cross-entropy loss over the softmax outputs:

p=softmax(Wbxm+bb),Lcls=[(1y)logp0+ylogp1]p = \mathrm{softmax}(W_b x_m + b_b),\quad L_{\mathrm{cls}} = -[(1-y)\log p_0 + y\log p_1]

with yy as the binary label. The total objective contains only this classification loss; no auxiliary, alignment, or adversarial losses are introduced.

Key Hyperparameters:

  • Optimizer: Adam (β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999)
  • Learning rate: 1×1041 \times 10^{-4}
  • Batch size: 16 (balanced)
  • Training epochs: up to 30, early stopping on dev set
  • Co-attention heads: h=4h=4, dimension dk=128d_k=128
  • Self-attention heads: h=2h=2
  • Max: m=83m=83 frames, n=50n=50 audio patches, k=23k=23 comments

This setup is designed to balance effective learning of cross-modal and contextual signals with robust generalization.

7. Empirical Performance and Ablations

On the FakeSV benchmark (five-fold event-split cross-validation), SV-FEND achieves:

  • Accuracy: 79.31±2.75%79.31 \pm 2.75\%
  • Macro-F1: 79.24±2.79%79.24 \pm 2.79\%

SV-FEND surpasses all single-modality baselines (text-BERT: 76.82%76.82\%) and previously leading multimodal approaches (best prior: \sim75.07\%).

Ablation results demonstrate:

  • Removal of all news content: accuracy drops to 74.89%74.89\% (−4.42).
  • Removal of all social context: accuracy drops to 78.62%78.62\% (−0.69).
  • Content ablation: text removal has the greatest effect (75.37%75.37\%), followed by visual (frames+clips, 77.97%77.97\%), then audio (78.95%78.95\%).
  • Social ablation: user profile removal yields 78.76%78.76\%, comments 79.09%79.09\%.

Temporal splits (train on earlier 70%70\% of events, test on future 30%30\%) yield 81.05%81.05\% accuracy, indicating robustness to emerging fake news events.

Summary Table: SV-FEND vs. Baselines (FakeSV Benchmark)

Model Accuracy (%) Macro-F1 (%)
SV-FEND 79.31 79.24
Best prior (multi-modal) ~75.07
Best text (BERT) 76.82

The empirical results verify the utility of two-stage attention mechanisms for selecting and integrating informative cues from both multimodal content and social context. Removal of individual components quantifies their unique contribution, with textual content being most critical among content features and publisher profile among social signals.


In totality, SV-FEND advances multimodal fake news detection on short video platforms via hierarchical attention and context-aware social signal integration, establishing a new state of the art on the FakeSV dataset (Qi et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SV-FEND Model.