Real-Centered Detection Network (RCDN)
- The paper introduces a dual-branch CNN that fuses spatial and frequency features to robustly distinguish authentic faces from forgeries.
- It leverages a real-centered loss formulation alongside classification and separation losses to enforce a compact real-face embedding and push forgeries away.
- Empirical evaluations on the DiFF dataset show that RCDN achieves state-of-the-art in-domain accuracy and superior cross-domain stability compared to traditional detectors.
The Real-Centered Detection Network (RCDN) is a dual-branch convolutional neural network designed for robust face forgery identification in scenarios characterized by rapidly evolving generative methods. RCDN anchors the feature representation around authentic facial images (the "real-center"), pushing all forgeries away from this center in the embedding space, thereby achieving strong generalization to unseen forged image distributions. The framework combines frequency and spatial feature extraction, a cross-domain-optimized network geometry, and a real-centered loss formulation to address the generalization gap inherent in traditional CNN-based forgery detectors (McCurdy et al., 17 Jan 2026).
1. Motivation and Core Principles
The emergence of advanced facial synthesis pipelines—ranging from diffusion-based to GAN-based forgery tools—has led to a proliferation of highly realistic fake images, presenting major challenges for automated forgery detectors. Conventional detectors (e.g., Xception, EfficientNet, ResNet+CBAM) achieve over 98% in-domain accuracy when trained and tested on the same forgery category but exhibit an 8–10 point degradation on cross-domain forgeries. This stems from overfitting to the idiosyncrasies of specific fake data distributions and an inability to accommodate distribution shifts as new forgery pipelines appear.
RCDN is predicated on the observation that while distributions of synthesized fakes are diverse and dynamic, the distribution of real face images is comparatively stable. Instead of modeling all conceivable forgery patterns, RCDN selectively anchors its representation around real facial images and enforces that forgeries are distinct, regardless of their generation method. This paradigm shift targets the core requirement for practical, future-proof defenses against forgery: robust, cross-domain identification.
2. Architecture and Feature Extraction
RCDN utilizes a dual-branch architecture consisting of spatial and frequency branches, followed by feature fusion, projection, and dual-head supervision:
- Spatial Branch: Processes the RGB face input of size via a pre-pooling Xception backbone, outputting a $2048$-dimensional vector . This branch prioritizes semantic and structural cues, such as consistency of facial parts and identity.
- Frequency Branch: Applies a 2D FFT , re-centers the frequency spectrum, extracts magnitude, and applies log-compression with channel-wise standardization to isolate subtle spectral artifacts salient in fakes (e.g., high-frequency noise, diffusion inconsistencies). A lightweight ConvNet processes these features, yielding .
- Feature Fusion and Projection: Concatenates the spatial and frequency vectors (), projects to a $128$-dimensional embedding via a multilayer perceptron, and -normalizes the output: .
- Supervision Heads:
- Classification Head: A linear layer maps to class logits, optimized with cross-entropy.
- Real-Centered Head: A learnable vector defines the "real-center." Real samples are pulled toward , while fakes are penalized if they lie within a margin of .
3. Loss Functions and Embedding Geometry
RCDN introduces a hybrid loss combining classification, center, and separation objectives:
- Classification Loss: Standard cross-entropy on predicted logit for real/fake binary label.
- Center Loss: Enforces compactness of real embeddings near the real-center and margins fake embeddings away:
- Separation Loss: Ensures the mean distance of fakes to exceeds that of reals by margin on a batch level:
The final objective is
where are hyperparameters tuned via validation.
Embedding normalization is critical for stable distance-based optimization and geometry enforcement.
4. Empirical Evaluation
Comprehensive experimental validation is conducted on the DiFF dataset, comprising over 500,000 forgeries from 13 diffusion models, with diverse text and visual prompts, and three core forgery categories: Face Editing (FE), Image-to-Image (I2I), and Text-to-Image (T2I). Training and evaluation protocols include both in-domain and cross-domain settings.
Performance Metrics
- In-domain Accuracy: All leading baselines exceed 98%. RCDN achieves state-of-the-art:
- FE: 0.9995
- I2I: 0.9975
- T2I: 0.9990
- Average: 0.9987
- Cross-domain Accuracy (average off-diagonal, i.e., train on one, test on others):
| Method | Cross Average |
|---|---|
| Xception | 0.8970 |
| EfficientNet | 0.9075 |
| ResNet+CBAM | 0.9015 |
| DIRE | 0.9048 |
| RCDN | 0.9369 |
- Cross/ In-domain Stability Ratio and Gap:
| Method | In-domain | Cross Avg | Gap | Ratio |
|---|---|---|---|---|
| Xception | 0.9887 | 0.8970 | 0.0917 | 0.907 |
| EfficientNet | 0.9930 | 0.9075 | 0.0855 | 0.914 |
| ResNet+CBAM | 0.9837 | 0.9015 | 0.0822 | 0.916 |
| DIRE | 0.9775 | 0.9048 | 0.0727 | 0.926 |
| RCDN | 0.9987 | 0.9369 | 0.0618 | 0.938 |
Ablation studies reveal that elimination of the frequency branch or real-centered objectives consistently degrades cross-domain robustness by 3–4 points and reduces the stability ratio from 0.938 to approximately 0.92, demonstrating the necessity of both components.
5. Model Training and Implementation Details
For each forgery category, 10,000 training and 2,000 testing images are sampled from DiFF. Preprocessing includes face cropping and resizing for RGB input, and frequency transformation for the auxiliary branch. Both branches are optimized jointly in an end-to-end fashion, using Adam with weight decay and learning rate scheduling. The margins and balance weights are set through validation-specific tuning.
This suggests that hyperparameter selection is moderately sensitive and should be dataset-specific. A plausible implication is that further work is required to automate or stabilize this process, particularly as the real data distribution evolves.
6. Analysis, Limitations, and Future Directions
RCDN shifts the detection paradigm from enumerating fake patterns to defining a geometrically stable "real island" in feature space. Empirical results confirm that authentic faces consistently form a compact, high-density cluster, while all known tested forgeries, irrespective of synthesis method, are mapped outside this cluster. This yields higher resilience to unseen generation pipelines.
Limitations include the need for manual tuning of , , and per dataset, and potential drift of the real-center if the authentic image domain changes significantly (e.g., with variations in pose, lighting, or source distributions). Future extensions may involve dynamic margin scheduling, self-supervised pretraining regimes, and the integration of video deepfake detection via adapter modules.
7. Significance and Practical Implications
RCDN introduces a real-centered, dual-branch CNN that achieves state-of-the-art in-domain performance (≥99.8%), while reducing the generalization gap on cross-domain face forgery detection to 6.2 points (versus 7.3–10.0 for baselines) and producing the highest cross/in-domain stability ratio recorded (0.938) (McCurdy et al., 17 Jan 2026). Its design enables practical deployment as a front-line defense system, since only authentic examples require comprehensive representation.
Key practical implications include:
- Enhanced robustness to future and unseen facial forgery pipelines, including next-generation GANs and diffusion-based models.
- The requirement for real images to be well-represented, with no necessity for enumerating all possible forgery methods.
- Open-source availability facilitating reproducibility and further research.
By anchoring detection around the statistical invariance of real faces, RCDN provides a scalable, domain-robust solution to cross-domain face forgery identification.