Domain-generalizable Face Anti-Spoofing with Patch-based Multi-tasking and Artifact Pattern Conversion

Published 10 Apr 2026 in cs.CV | (2604.09018v1)

Abstract: Face Anti-Spoofing (FAS) algorithms, designed to secure face recognition systems against spoofing, struggle with limited dataset diversity, impairing their ability to handle unseen visual domains and spoofing methods. We introduce the Pattern Conversion Generative Adversarial Network (PCGAN) to enhance domain generalization in FAS. PCGAN effectively disentangles latent vectors for spoof artifacts and facial features, allowing to generate images with diverse artifacts. We further incorporate patch-based and multi-task learning to tackle partial attacks and overfitting issues to facial features. Our extensive experiments validate PCGAN's effectiveness in domain generalization and detecting partial attacks, giving a substantial improvement in facial recognition security.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a novel framework leveraging a PCGAN for artifact pattern augmentation and a patch-based multi-tasking network to boost generalization.
Methodologically, it disentangles facial content from spoof artifacts using spatial skip connections and patch-wise adversarial losses.
Experimental results show a significant reduction in ACER and improved AUC, demonstrating robust performance under cross-domain and partial attack scenarios.

Domain-Generalizable Face Anti-Spoofing with Patch-Based Multi-Tasking and Artifact Pattern Conversion

Introduction and Problem Context

Face recognition (FR) systems have become ubiquitous in security-critical applications, prompting the persistent challenge of presentation attacks (PAs) and digital manipulation (e.g., deepfakes). The reliability of face anti-spoofing (FAS) algorithms is severely constrained by the domain diversity and scale of available datasets. Conventional models exhibit pronounced performance degradation under cross-domain and unseen attack conditions, primarily due to overfitting to specific devices, environments, and artifact distributions. This work addresses these critical limitations through two core innovations: a Pattern Conversion GAN (PCGAN) for artifact-level augmentation and a Patch-based Multi-tasking Network (PMN) for fine-grained detection robustness.

Figure 1: Overall framework illustrating (a) disentanglement and conversion of spoof artifact patterns by combining spatial facial content features and artifact representations from different images, (b) PCGAN-based data augmentation and patch dataset generation, (c) Patch-based Multi-tasking Network for joint learning from patched and full face images.

Methodology

Pattern Conversion GAN (PCGAN)

PCGAN facilitates explicit disentanglement between facial content and spoof artifact patterns—mathematically, mapping input images $x \in \mathbb{R}^{3 \times 1024 \times 1024}$ into content ( $z_{con}$ ) and artifact ( $z_{pat}$ ) latent codes. Unlike prior generative approaches (e.g., StyleGAN-based domain adaptation [styleassemble], cycle-consistent transfer [yadav2021cit]), PCGAN (based on modified swapping auto-encoders [SAE]) employs minimal downsampling in the encoder to preserve high-frequency artifact details (e.g., moiré, dot matrix, screen glare):

Generator: Integrates $z_{con}$ via spatially-aware skip connections and injects $z_{pat}$ through AdaIN at each CNN block.
Discriminator/Patch Discriminator: Patch-wise adversarial losses enforce artifact-specific fidelity.

PCGAN is trained with joint reconstruction, blurred reconstruction, adversarial, and patch-disentanglement losses. The blurred loss enforces content consistency under artifact translation by downsampling, thus suppressing high-frequency artifact variance while retaining semantic identity.

Figure 2: PCGAN can remove (c) or inject (d) artifact patterns, as visualized by Canny and Hough transforms, demonstrating effective artifact-level manipulations across facial samples.

Patch-Based Multi-Tasking Network (PMN)

PMN incorporates CLIP-based multimodal encoders, MLPs, and multi-task heads. This network jointly supervises both global-face and randomly-cropped patch images to enhance robustness against partial or local attacks (e.g., paper edges, display corners):

Patch Cropping: Random regions (scale 0.2–1.0) are sampled, facilitating the learning of multi-scale, spatially-varying artifacts. Figure 3 visualizes the various cropping schemes, highlighting their contribution to spatial coverage and model generalization.
Figure 3: Representative cropping strategies for generating patch datasets, enabling multi-scale, locality-aware training for fine-grained anti-spoofing.
Losses: PMN is trained with CLIP loss (aligning embeddings of text prompts and images), face/patch classification losses, and a center loss to maximize intra-class compactness and improve open-set generalization properties.

Synthesized Data Augmentation and Patch Training

PMN is trained using both original images and PCGAN-augmented images, including live-to-spoof and spoof-to-live conversions. This augmentation bridges both label and domain spaces, ensuring the detector is robust not only to seen artifact variations but also to synthetic, unseen artifacts.

Experimental Results

Benchmark Evaluation and Analysis

The proposed method is evaluated on several established FAS datasets under domain generalization protocols, including OULU-NPU, CASIA-FASD, MSU-MFSD, Replay-Attack, CASIA-SURF, CASIA-SURF CeFA, and WMCA. Key metrics are ACER (Average Classification Error Rate) and AUC.

Strong Numerical Claims:

On cross-domain FAS (e.g., OCM→I), PCGAN+PMN achieves an ACER of 3.33%, outperforming leading CLIP-based methods without relying on extra external datasets (e.g., CelebA-Spoof).
In large-scale cross-ethnicity benchmarks (e.g., CSW), the approach yields the best average ACER (10.78%) and AUC (95.58%) among strong baselines (Table: tab:csw).
Even when trained without real live images but only PCGAN-synthesized artifact-free “live” samples, performance remains competitive (average ACER: 5.70%), emphasizing effective semantic preservation in artifact removal.
Figure 4: Grad-CAM visualizations show that the proposed method with both PMN and PCGAN (d) focuses activation on spoof-affected regions, outperforming ResNet50 baselines and PMN-only variants, especially for partial attacks.

Ablation and Component Analysis

Component contribution: Patch-level learning, center loss, and PCGAN-generated synthetic samples are all necessary for optimal generalization. Their synergy outperforms any subset, with full combination yielding average ACER as low as 2.79% in multi-source settings.
Data augmentation: Sole reliance on synthesized images is suboptimal, but combined real and synthetic training demonstrates significant synergistic gains.
Patch size and cropping: Random multi-scale cropping provides the most consistent generalization, as fixed cropping underperforms on certain domains. Figure 3 and related tables confirm the critical nature of spatial diversity in patch sampling.
Figure 5: Example of Sobel-filtered artifacts illustrating how spoof-related high-frequency patterns are extracted and suppressed through artifact disentanglement and conversion.
Qualitative: Visual interpretation via Canny/Hough and Grad-CAM shows effective transfer and removal of attack-specific artifacts, validating the semantic plausibility and physical relevance of pattern conversion.

Computational Cost

The training addition of GAN-based sample generation (109M params, 95G FLOPs) is negligible during inference, as only the CLIP-based detector (86M params, 17.6G FLOPs) is used. This renders the approach computationally efficient relative to diffusion-based augmentation approaches (e.g., AG-FAS).

Implications and Future Directions

The artifact-centric approach of PCGAN, coupled with patch-level multi-task supervision, advances the state of domain-generalizable FAS through:

Explicit artifact disentanglement: Facilitates more realistic and controllable augmentation of artifact patterns, crucial for robustness in the open world where new spoofing mediums and techniques are regularly encountered.
Patch-level reasoning: Enables local supervisor signals for detection of partial, mixed, or subtle spoofs, outperforming global-only or domain-adaptive baselines.
Synergistic augmentation: Combination of real and synthetic data is essential; synthetic-only models do not match the fidelity and diversity required for full generalization.

Potential future directions include: expanding the diversity and sophistication of synthetic artifact patterns (possibly through larger generative models or diffusion frameworks), developing temporal or video-based FAS leveraging artifact-level augmentation, extending patch-based multi-tasking to non-facial biometrics, and investigating cross-modal feature disentanglement.

Conclusion

This work introduces a domain-generalizable face anti-spoofing solution, leveraging artifact-focused data augmentation through PCGAN and patch-based multi-task learning by PMN. The methodology demonstrates compelling improvements in cross-domain generalization, partial attack robustness, and computational efficiency. The explicit focus on artifact pattern manipulation and patch-level supervision addresses both theoretical and practical constraints in FAS. Extending this paradigm to richer generative schemes and broader biometric modalities represents a logical and promising trajectory for future investigation.

Markdown Report Issue