RT-DeepLoc: Temporal Deepfake Localization

Updated 5 February 2026

The paper introduces a reconstruction-based framework that leverages reconstruction error spikes to localize deepfakes temporally in audio-visual content.
It utilizes both fully supervised and weakly supervised variants, employing cross-modal neural architectures and masked autoencoders for detailed frame-level and video-level analysis.
Evaluations on datasets like LAV-DF and AV-Deepfake1M demonstrate that RT-DeepLoc outperforms prior approaches with enhanced cross-domain robustness and precision.

Reconstruction-based Temporal Deepfake Localization (RT-DeepLoc) is a family of frameworks for fine-grained temporal localization of forgeries in multimodal (audio-visual) media via reconstruction-based anomaly detection. The core rationale is that a model trained to reconstruct authentic, temporally coherent representations will exhibit significantly higher reconstruction error on anomalous, i.e., manipulated, video segments. This approach leverages deep cross-modal neural architectures, including Masked Autoencoders (MAEs), and reconstruction-based classification, often in both fully and weakly supervised regimes. The RT-DeepLoc methodology is central to two state-of-the-art systems: a fully supervised variant relying on detailed frame-level ground truth (Koutlis et al., 24 Nov 2025) and a weakly supervised MAE-based framework using only video-level labels (Guo et al., 29 Jan 2026).

1. Problem Definition and Conceptual Foundations

RT-DeepLoc addresses the task of detecting and temporally localizing manipulated segments ("deepfakes") in digital video, often with aligned audio. The input is a pair $(V, A)$ denoting video frames and corresponding audio, sampled to $t$ temporally aligned units. The output is a sequence of binary frame labels $(p^1, ..., p^t) \in \{0,1\}^t$ , where $p^\tau=1$ indicates manipulation at timestep $\tau$ , and segment boundaries $[s^\tau, e^\tau]$ for manipulated regions.

Two variants define the primary approaches:

Supervised RT-DeepLoc learns a direct mapping from $(V,A)$ to per-frame manipulation labels and boundaries, using densely labeled training data (Koutlis et al., 24 Nov 2025).
Weakly Supervised RT-DeepLoc uses only video-level (real/fake) labels. It identifies manipulated intervals based on discrepancies in reconstruction error induced by a Masked Autoencoder trained exclusively on real video, coupled with contrastive losses promoting cluster compactness among genuine samples (Guo et al., 29 Jan 2026).

Both methods premise their localization capacity on the hypothesis that authentic audio-visual content contains strong mutual dependencies and temporal coherence, whereas deepfakes—by virtue of synthetic artifacts or cross-modal inconsistencies—cause systematic deviations in model reconstructions, observable as frame-localized residual spikes.

2. Model Architecture and Reconstruction Modules

Supervised Approach (Koutlis et al., 24 Nov 2025):

Backbone Feature Extraction: Utilizes a frozen pre-trained audio-visual speech model (e.g., AV-HuBERT) with separate convolutional encoders for audio and video, yielding synchronized latent features $X_v, X_a \in \mathbb{R}^{t \times d}$ .
Reconstruction Module ( $\mathfrak{R}$ ): Implements a sequence of 1D convolutional blocks with down-sampling and up-sampling, mapping inputs ( $X_a$ or $X_v$ ) to outputs of shape $(t, d)$ . Three reconstruction tasks are conducted: audio-to-visual ( $a \to v$ ), visual-to-visual ( $v \to v$ ), and audio-to-audio ( $a \to a$ ).
Discrepancy Encoding: Constructs frame-wise residuals for each task, then concatenates them to form a per-frame discrepancy feature $Z \in \mathbb{R}^{t \times 3d}$ .

Weakly Supervised Approach (Guo et al., 29 Jan 2026):

Feature Backbone: Extracts pre-aligned visual and audio features from off-the-shelf models (e.g., TSN for video, Wav2Vec for audio), each as $\mathbb{R}^{T \times C}$ .
Masked Autoencoder (MAE):
- Independently masks random temporal segments in each modality (masking ratio $p=0.75$ ), encodes only visible tokens via a 12-layer Transformer, and decodes the full sequence through a 4-layer Transformer.
- Applies a "genuine-focused" reconstruction loss on real videos,
$L_{\text{recon}} = I(y{=}0) \sum_{j \in M} \|F_j - F^{rec}_j\|_2^2,$

where $M$ denotes masked indices.

3. Supervised and Weakly Supervised Training Objectives

Fully Supervised (Koutlis et al., 24 Nov 2025):

Reconstruction Loss: Calculated only on authentic ( $p^\tau=0$ ) frames for all three reconstruction tasks. The total loss is

$\mathcal{L}_{\mathrm{rec}} = \mathcal{L}_{\mathrm{rec}^{a\to v}} + \mathcal{L}_{\mathrm{rec}^{v\to v}} + \mathcal{L}_{\mathrm{rec}^{a\to a}}$

Detection-Localization Loss: Combines focal loss for per-frame detection, DIoU for temporal segment boundary regression, and total reconstruction loss,

$\mathcal{L} = \alpha \mathcal{L}_{\mathrm{loc}} + \beta \mathcal{L}_{\mathrm{rec}}$

(Typical setting: $\alpha=\beta=1$ .)

Weakly Supervised (Guo et al., 29 Jan 2026):

Reconstruction Loss (as above)
Multi-head Video Classification ( $L_{CLS}$ ): Supervision via seven classification heads acting on original, reconstructed, and fused representations, with frame score extraction and Top-K pooling.
KL Consistency ( $L_{KL}$ ): Enforces agreement between classification outputs of original and reconstructed branches using symmetric KL-divergence.
Asymmetric Intra-video Contrastive Loss (AICL):
- Computes per-frame reconstruction errors, selects Top-K error hotspots, and forms local representative features.
- For each anchor real video, mines hardest positives (other reals) and hardest negatives (fakes) for triplet loss:
$L_{AICL} = \frac{1}{|R|} \sum_{i \in R} \max \left\{ \|f_i - f^+_i\|^2_2 - \|f_i - f^-_i\|^2_2 + m,\; 0 \right\}$

(Default margin $m=0.3$ .)
Total Loss:

$L_\text{total} = \lambda_1 L_{CLS} + \lambda_2 L_{recon} + \lambda_3 L_{KL} + \lambda_4 L_{AICL}$

( $\lambda_1=1.0, \lambda_{2,3,4}=0.1$ .)

4. Inference and Temporal Localization Pipeline

Frame-Level Scoring:

In both variants, reconstruction discrepancies are converted into normalized frame scores $s_t$ that indicate forgery probability.

Temporal Proposal Generation:

Thresholding at $s_t > \tau$ yields candidate sequences of manipulated frames.
Consecutive manipulated frames form proposals. Minor gaps are merged if the break is less than a predefined threshold ( $<\Delta$ frames).

Ranking and Non-Maximum Suppression:

Proposals are assigned scores by mean frame confidence.
Final localization outputs are produced after non-maximum suppression, typically with an IoU threshold (e.g., 0.5) and a cap on the number of proposals (Top-M).

Video-level Decision:

A max-confidence or logical-OR aggregation over frame predictions provides a global real/fake label.

5. Datasets, Implementation, and Quantitative Evaluation

Datasets:

LAV-DF: Large-scale, audio-visual deepfake dataset (78,703 train/31,501 val/26,100 test for supervised; 36,431 real/99,873 fake for weak supervision).
AV-Deepfake1M: 746,180 train/343,240 test (supervised); 420K real/720K fake (weak supervision).

Metrics:

Temporal Localization: AP@{0.5,0.75,0.95} (supervised), mAP at IoU thresholds {0.1,...,0.7} (weakly supervised), AR at proposal budgets.
Detection: AUC and AP.

Experimental Configuration:

Training on NVIDIA RTX 4090, batch size 32–64.
Adam optimizer; learning rates 1e-3 (supervised), 1e-5 (MAE variant).
Sequence length $t=512$ frames, zero-padded.
Masking ratio $p=0.75$ (weak supervision).

Results Table:

Method	Supervision	LAV-DF mAP/AR (%)	AV-Deepfake1M mAP/AR (%)	Cross-domain mAP/AR (%)
RT-DeepLoc (MAE)	Weak	84.18 / 84.03	32.89 / 48.40	16.66 / 81.19
ActionFormer (2022)	Full	96.76 / 98.01	66.87 / 81.82	–
LOCO (2025)	Weak	44.28 / 52.28	0.24 / 8.25	0.05 / 5.29

RT-DeepLoc substantially outperforms all prior weak supervision methods and maintains competitive cross-domain generalization. On the supervised framework, RT-DeepLoc sets state-of-the-art AP and AUC on both LAV-DF and AV-Deepfake1M, and shows robust detection even on in-the-wild manipulations.

6. Ablation Studies and Qualitative Analysis

Ablation studies indicate:

Removal of reconstruction signal ("w/o FDN") causes substantial mAP drop (to 62.26%).
Excluding AICL or multi-task loss reduces both mAP and AR, confirming the necessity of each component.

Qualitative trends reveal strong alignment of reconstruction-error peaks with ground-truth manipulated intervals. False positives are frequently associated with highly compressed videos, segments containing multiple speakers, or rare languages. Boundary spikes are observed but are less pronounced than primary forgery-induced errors.

7. Limitations, Extensions, and Research Directions

Both frameworks are robust to noise and compression due to reliance on frozen backbone features. Cross-modal reconstruction amplifies deepfake artifacts more effectively than feature concatenation or attention-based strategies. Potential directions for extension include incorporation of adversarial losses on reconstructions, multi-scale temporal modeling, utilization of fine-grained speech transcript alignment, and full end-to-end fine-tuning.

A plausible implication is that modeling bounded priors of authentic content (via reconstruction) enables stronger, more generalizable deepfake localization than direct learning of unbounded forgery patterns—a hypothesis supported by cross-domain results (Guo et al., 29 Jan 2026). Further research may address improved handling of ambiguous or anomalous real content and tighter integration with semantic supervision signals.

Markdown Report Issue Upgrade to Chat

References (2)

AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization (2025)

Mining Forgery Traces from Reconstruction Error: A Weakly Supervised Framework for Multimodal Deepfake Temporal Localization (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reconstruction-based Temporal Deepfake Localization (RT-DeepLoc).

RT-DeepLoc: Temporal Deepfake Localization

1. Problem Definition and Conceptual Foundations

2. Model Architecture and Reconstruction Modules

3. Supervised and Weakly Supervised Training Objectives

4. Inference and Temporal Localization Pipeline

5. Datasets, Implementation, and Quantitative Evaluation

6. Ablation Studies and Qualitative Analysis

7. Limitations, Extensions, and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

RT-DeepLoc: Temporal Deepfake Localization

1. Problem Definition and Conceptual Foundations

2. Model Architecture and Reconstruction Modules

3. Supervised and Weakly Supervised Training Objectives

4. Inference and Temporal Localization Pipeline

5. Datasets, Implementation, and Quantitative Evaluation

6. Ablation Studies and Qualitative Analysis

7. Limitations, Extensions, and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research