VFDNET: Visible Fusion Detection Network

Updated 10 January 2026

VFDNET is a Vision Transformer-based framework that enables deepfake detection and multimodal fusion through adaptive matching and wavelet denoising.
It employs a pure-transformer model for deepfake discrimination and a hybrid IA-VFDnet extension for handling unregistered visible-infrared inputs.
The architecture integrates advanced self-attention, dual-branch encoders, and robust data augmentation to achieve state-of-the-art accuracy and generalization.

VFDNET refers to the Visible Fusion Detection Network and its architectural derivatives, notably Vision Transformer–powered models for image classification and hybrid CNN-Transformer fusion frameworks for multimodal detection. These systems are optimized for tasks such as robust deepfake discrimination and high-quality visible-infrared fusion, operating both under assumptions of geometric registration and registration-free scenarios. VFDNET’s pure-transformer variant achieves state-of-the-art deepfake detection via global self-attention, while the IA-VFDnet extension integrates adaptive matching and wavelet fusion to process unregistered multimodal inputs for object localization and detection.

1. Architectural Principles of VFDNET

VFDNET, as delineated in the context of deepfake detection, is a pure Vision Transformer architecture operating on RGB face images resized to 224×224 and normalized to [0,1]. The inference pipeline first decomposes each image into $N$ non-overlapping $P \times P$ patches (where $N = (224/P)^2$ ; $P$ is not specified), then vectorizes and projects each patch into a $C$ -dimensional embedding space via a learned matrix $E \in \mathbb{R}^{C \times (P \cdot P \cdot 3)}$ . A learnable “class” token $h_{\text{class}} \in \mathbb{R}^{1 \times C}$ is prepended, and positional encoding $h_{\text{position}} \in \mathbb{R}^{(1+N)\times C}$ is added to the sequence.

The core processing unit is the Transformer encoder, structured as $L$ identical blocks. Each block applies layer normalization, multi-head self-attention (MSA), residual connections, and a 2-layer feed-forward network (FFN) with GELU activation. The class-token embedding after $L$ blocks ( $h_L^{\text{class}}$ ) is passed through a linear head and sigmoid activation for binary classification, e.g., deepfake versus authentic image.

2. Mathematical Formulation

VFDNET models self-attention as:

$\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^\top}{\sqrt{d_k}} \right)V$

where $Q, K, V \in \mathbb{R}^{(1+N)\times C}$ after linear projection, $h$ is the number of attention heads, and $d_k = C/h$ . The positional encoding matrix $h_{\text{position}}$ is learned and added prior to Transformer blocks.

The loss used is batch-averaged binary cross-entropy:

$L_{\text{BCE}}(p, y) = -[y \cdot \log p + (1 - y) \cdot \log(1 - p)]$

No auxiliary regularization (e.g., weight decay, label smoothing) or secondary losses are specified.

3. Data Processing, Augmentation, and Training Protocols

The data pipeline for VFDNET includes resizing images to $224 \times 224$ and pixel normalization. Experiments use a dataset of 140,000 images split into training (70%), validation (15%), and test (15%) partitions. For VFDNET, data augmentation utilizes AutoAugment and RandAugment (random geometric and color operations), with combined variants for enhanced diversity. CNN baselines are subjected to random rotation, scaling, horizontal flipping, and histogram-based normalization.

Model training employs the Adam optimizer ( $\beta_1 = 0.9$ , $\beta_2 = 0.999$ ), with a batch size of 16. Exact learning rates and epoch counts are not specified, but plots suggest 20–30 epochs. Learning-rate scheduling is not documented.

A simplified training loop is:

for epoch in 1..E:
  for batch (x, y) in train_loader:
    x_aug = AutoAugment(x)
    p = VFDNET(x_aug)
    loss = BCE(p, y)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
  validate on val_loader

(Urmi et al., 3 Jan 2026)

4. Performance Evaluation and Benchmarking

VFDNET demonstrates high detection accuracy on binary deepfake tasks. Table-based comparison from the provided test set (15% split) yields:

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
VFDNET	99.13	99.00	99.00	99.00
MobileNetV3	98.00	98.11	98.09	98.09
DFCNET	95.76	92.00	91.00	89.00
ResNet50	84.28	84.00	84.00	84.00

Confusion matrix analysis confirms balanced error rates for VFDNET. In literature, VFG16 reports comparative accuracy (99.00%), while transformer and ensemble learning baselines report lower accuracy on matched datasets.

5. Hybrid Extensions: IA-VFDnet and Registration-Free Fusion

IA-VFDnet represents a hybrid CNN–Transformer evolution of VFDNET, designed for multimodal (visible-infrared) detection under unregistered conditions (Guan et al., 2023). The IA-VFDnet architecture comprises:

Dual-branch encoder: RepVGG (CNN) for vis features; Swin-Transformer for IR features.
Multi-scale fusion at shallow, middle, and deep levels, mediated by two key modules:
- Adaptive Kuhn-Munkres Matching (AKM): Learns soft matching of sub-features between branches using Chebyshev distance and attention-weighting, guided by a measurement loss $L_m = \frac{1}{M N}\sum_{i=1}^M\sum_{j=1}^N \|x_i - a_j y_j\|_2^2$ .
- Wavelet Domain Adaptive Fusion (WDAF): Employs 2D Discrete Wavelet Transform (DWT) to decompose features into sub-bands, applies convolutional denoising, and reassembles fused features via inverse DWT.

Fused feature scales are concatenated and passed to a YOLOX detection head for bounding-box and classification outputs.

6. Empirical Results and Comparative Analysis

On the registered M3FD dataset, IA-VFDnet outperforms prior fusion detector methods across six categories, notably Person, Car, Bus, Motorcycle, Lamp, and Truck. Sample results for [email protected]:

Method	[email protected]
DenseFuse	0.6753
FusionGAN	0.3938
IFCNN	0.7283
DDcGAN	0.5507
U2Fusion	0.7596
TarDAL	0.8226
IA-VFDnet	0.9075

For unregistered detection benchmarks (IA-VSW), IA-VFDnet maintains robust mAP under both small and large parallax: overall $\approx$ 0.86 (Smoke 0.9088, Wildfire 0.8182 for small; Smoke 0.9091, Wildfire 0.8177 for large parallax).

7. Registration-Free Fusion Methodology

Traditional VFDNet required pre-registration of input modalities. IA-VFDnet innovates via internal feature matching (AKM), eliminating geometric registration. Its wavelet-domain fusion (WDAF) refines matched features and denoises, supporting robust detection of small-scale targets (e.g., distant smoke) and accommodating parallax and sensor misalignment. The multi-scale fusion strategy makes IA-VFDnet resilient to modality mismatches and noise without any external alignment module.

8. Discussion and Application Scope

VFDNET’s transformer-based attention mechanisms enable rapid detection of global facial inconsistencies, demonstrated on deepfake image discrimination. The architecture achieves generalization with comparatively fewer parameters and avoids overfitting (val loss $\approx$ 0.0068), in contrast to deep CNNs such as ResNet50. MobileNetV3 provides a lightweight, strong baseline but does not match VFDNET’s performance, suggesting that transformer global attention outperforms depth-wise separable convolutions for these detection tasks.

IA-VFDnet generalizes VFDNET to the challenging domain of multimodal fusion under non-ideal, registration-free conditions. Its modular adaptive matching and denoising fusion pipeline sets new performance standards on registered and unregistered imaging benchmarks.

In summary, VFDNET encapsulates both Vision Transformer–based classification for deepfake detection and, via its hybrid IA-VFDnet extension, registration-free multimodal fusion for challenging IR-VIS detection tasks. These models represent significant advances in attention-based and hybrid architectures for trustworthy, high-performance visual analysis (Urmi et al., 3 Jan 2026, Guan et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

AI-Powered Deepfake Detection Using CNN and Vision Transformer Architectures (2026)

Registration-Free Hybrid Learning Empowers Simple Multimodal Imaging System for High-quality Fusion Detection (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VFDNET.