VFDNET: Visible Fusion Detection Network
- VFDNET is a Vision Transformer-based framework that enables deepfake detection and multimodal fusion through adaptive matching and wavelet denoising.
- It employs a pure-transformer model for deepfake discrimination and a hybrid IA-VFDnet extension for handling unregistered visible-infrared inputs.
- The architecture integrates advanced self-attention, dual-branch encoders, and robust data augmentation to achieve state-of-the-art accuracy and generalization.
VFDNET refers to the Visible Fusion Detection Network and its architectural derivatives, notably Vision Transformer–powered models for image classification and hybrid CNN-Transformer fusion frameworks for multimodal detection. These systems are optimized for tasks such as robust deepfake discrimination and high-quality visible-infrared fusion, operating both under assumptions of geometric registration and registration-free scenarios. VFDNET’s pure-transformer variant achieves state-of-the-art deepfake detection via global self-attention, while the IA-VFDnet extension integrates adaptive matching and wavelet fusion to process unregistered multimodal inputs for object localization and detection.
1. Architectural Principles of VFDNET
VFDNET, as delineated in the context of deepfake detection, is a pure Vision Transformer architecture operating on RGB face images resized to 224×224 and normalized to [0,1]. The inference pipeline first decomposes each image into non-overlapping patches (where ; is not specified), then vectorizes and projects each patch into a -dimensional embedding space via a learned matrix . A learnable “class” token is prepended, and positional encoding is added to the sequence.
The core processing unit is the Transformer encoder, structured as identical blocks. Each block applies layer normalization, multi-head self-attention (MSA), residual connections, and a 2-layer feed-forward network (FFN) with GELU activation. The class-token embedding after blocks () is passed through a linear head and sigmoid activation for binary classification, e.g., deepfake versus authentic image.
2. Mathematical Formulation
VFDNET models self-attention as:
where after linear projection, is the number of attention heads, and . The positional encoding matrix is learned and added prior to Transformer blocks.
The loss used is batch-averaged binary cross-entropy:
No auxiliary regularization (e.g., weight decay, label smoothing) or secondary losses are specified.
3. Data Processing, Augmentation, and Training Protocols
The data pipeline for VFDNET includes resizing images to and pixel normalization. Experiments use a dataset of 140,000 images split into training (70%), validation (15%), and test (15%) partitions. For VFDNET, data augmentation utilizes AutoAugment and RandAugment (random geometric and color operations), with combined variants for enhanced diversity. CNN baselines are subjected to random rotation, scaling, horizontal flipping, and histogram-based normalization.
Model training employs the Adam optimizer (, ), with a batch size of 16. Exact learning rates and epoch counts are not specified, but plots suggest 20–30 epochs. Learning-rate scheduling is not documented.
A simplified training loop is:
1 2 3 4 5 6 7 8 9 |
for epoch in 1..E: for batch (x, y) in train_loader: x_aug = AutoAugment(x) p = VFDNET(x_aug) loss = BCE(p, y) loss.backward() optimizer.step() optimizer.zero_grad() validate on val_loader |
4. Performance Evaluation and Benchmarking
VFDNET demonstrates high detection accuracy on binary deepfake tasks. Table-based comparison from the provided test set (15% split) yields:
| Model | Accuracy (%) | Precision (%) | Recall (%) | F1-Score (%) |
|---|---|---|---|---|
| VFDNET | 99.13 | 99.00 | 99.00 | 99.00 |
| MobileNetV3 | 98.00 | 98.11 | 98.09 | 98.09 |
| DFCNET | 95.76 | 92.00 | 91.00 | 89.00 |
| ResNet50 | 84.28 | 84.00 | 84.00 | 84.00 |
Confusion matrix analysis confirms balanced error rates for VFDNET. In literature, VFG16 reports comparative accuracy (99.00%), while transformer and ensemble learning baselines report lower accuracy on matched datasets.
5. Hybrid Extensions: IA-VFDnet and Registration-Free Fusion
IA-VFDnet represents a hybrid CNN–Transformer evolution of VFDNET, designed for multimodal (visible-infrared) detection under unregistered conditions (Guan et al., 2023). The IA-VFDnet architecture comprises:
- Dual-branch encoder: RepVGG (CNN) for vis features; Swin-Transformer for IR features.
- Multi-scale fusion at shallow, middle, and deep levels, mediated by two key modules:
- Adaptive Kuhn-Munkres Matching (AKM): Learns soft matching of sub-features between branches using Chebyshev distance and attention-weighting, guided by a measurement loss .
- Wavelet Domain Adaptive Fusion (WDAF): Employs 2D Discrete Wavelet Transform (DWT) to decompose features into sub-bands, applies convolutional denoising, and reassembles fused features via inverse DWT.
Fused feature scales are concatenated and passed to a YOLOX detection head for bounding-box and classification outputs.
6. Empirical Results and Comparative Analysis
On the registered M3FD dataset, IA-VFDnet outperforms prior fusion detector methods across six categories, notably Person, Car, Bus, Motorcycle, Lamp, and Truck. Sample results for [email protected]:
| Method | [email protected] |
|---|---|
| DenseFuse | 0.6753 |
| FusionGAN | 0.3938 |
| IFCNN | 0.7283 |
| DDcGAN | 0.5507 |
| U2Fusion | 0.7596 |
| TarDAL | 0.8226 |
| IA-VFDnet | 0.9075 |
For unregistered detection benchmarks (IA-VSW), IA-VFDnet maintains robust mAP under both small and large parallax: overall 0.86 (Smoke 0.9088, Wildfire 0.8182 for small; Smoke 0.9091, Wildfire 0.8177 for large parallax).
7. Registration-Free Fusion Methodology
Traditional VFDNet required pre-registration of input modalities. IA-VFDnet innovates via internal feature matching (AKM), eliminating geometric registration. Its wavelet-domain fusion (WDAF) refines matched features and denoises, supporting robust detection of small-scale targets (e.g., distant smoke) and accommodating parallax and sensor misalignment. The multi-scale fusion strategy makes IA-VFDnet resilient to modality mismatches and noise without any external alignment module.
8. Discussion and Application Scope
VFDNET’s transformer-based attention mechanisms enable rapid detection of global facial inconsistencies, demonstrated on deepfake image discrimination. The architecture achieves generalization with comparatively fewer parameters and avoids overfitting (val loss 0.0068), in contrast to deep CNNs such as ResNet50. MobileNetV3 provides a lightweight, strong baseline but does not match VFDNET’s performance, suggesting that transformer global attention outperforms depth-wise separable convolutions for these detection tasks.
IA-VFDnet generalizes VFDNET to the challenging domain of multimodal fusion under non-ideal, registration-free conditions. Its modular adaptive matching and denoising fusion pipeline sets new performance standards on registered and unregistered imaging benchmarks.
In summary, VFDNET encapsulates both Vision Transformer–based classification for deepfake detection and, via its hybrid IA-VFDnet extension, registration-free multimodal fusion for challenging IR-VIS detection tasks. These models represent significant advances in attention-based and hybrid architectures for trustworthy, high-performance visual analysis (Urmi et al., 3 Jan 2026, Guan et al., 2023).