SuHiFiMask: Surveillance PAD Benchmark
- SuHiFiMask is a comprehensive benchmark dataset and evaluation paradigm for face presentation attack detection (PAD) under unconstrained, long-range surveillance conditions.
- The dataset comprises 10,195 video sequences from 101 volunteers across 40 diverse environments, capturing multi-view, low-resolution, and noisy face samples.
- Evaluation protocols employ metrics like ACER, APCER, and BPCER, while advanced methods (e.g., transformer backbones and CQIL) highlight both performance gains and persistent challenges in robust PAD.
SuHiFiMask—short for Surveillance High-Fidelity Mask—defines a benchmark dataset and evaluation paradigm for face presentation attack detection (PAD) under real-world, long-range surveillance conditions. The corpus and associated challenges systematically address the domain gap between traditional FAS deployments (e.g., phone unlocking) and unconstrained field surveillance, with a focus on performance under severe image quality degradations and diverse spoofing modalities (Fang et al., 2023, Fang et al., 2023).
1. Dataset Composition and Characteristics
SuHiFiMask comprises 10,195 video sequences of 101 volunteers representing multiple age categories, each recorded synchronously by seven commercial surveillance cameras per scene. Recordings were performed across 40 distinct environments—including cafés, cinemas, parking lots, and security-check lanes—and encompass four weather states (sunny, windy, cloudy, snowy) as well as both diurnal and nocturnal lighting conditions. Subjects appear at distances aligning with operational surveillance (greater than 3 m), yielding face crops with pronounced low resolution (often < 100 px width), motion-induced blur, occlusion, and compound noise artifacts (Fang et al., 2023, Fang et al., 2023).
Attack classes and sample statistics:
| Attack Modality | Sample Count | Materials or Presentation |
|---|---|---|
| 3D high-fidelity mask | 232 | Plaster, resin, silicone, head molds |
| 2D attack | 200 | Printed posters, portraits, replay screens |
| Adversarial attack | 2 | Custom adversarial mask, adversarial hat |
Data for each participant event is captured simultaneously by all seven cameras, providing multi-view, multi-resolution samples for each live or spoof action. The dataset is RGB-only; depth and IR channels are absent.
Preprocessing involves frame-level face detection (RetinaFace), short-clip face tracking, frame sampling (every 10th), and bounding-box cropping. Resulting crops are organized hierarchically by group, scene, camera, epoch, and timestamp (Fang et al., 2023).
2. Protocols and Performance Metrics
SuHiFiMask defines three main experimental protocols. Of particular note is Protocol-3, targeting quality robustness:
- Protocol-3 stratifies all subjects and attack types into three splits by SER-FIQ image quality score :
- Train: (159,063 crops)
- Development: (89,276 crops)
- Test: (161,882 crops)
Each split retains the full diversity of people, masks, scenes, and backgrounds, but with distinct image-quality regimes. Image quality, as measured by SER-FIQ, is a significant covariate for detection performance; as declines, error rates increase.
Recommended PAD evaluation metrics follow ISO/IEC 30107-3:
- APCER (Attack Presentation Classification Error Rate): fraction of attack samples misclassified as bona-fide.
- BPCER (Bona-fide Presentation Classification Error Rate): fraction of bona-fide samples misclassified as attack.
- ACER (Average Classification Error Rate): mean of APCER and BPCER, used as the official ranking criterion.
- AUC (Area Under ROC Curve) is also reported for robustness.
Cross-dataset and cross-quality generalization are quantified by HTER (Half Total Error Rate) and AUC (Fang et al., 2023).
3. Challenge Organization and Outcomes
The Surveillance Face Presentation Attack Detection Challenge, leveraging SuHiFiMask with Protocol-3, was featured at CVPR-2023 (Fang et al., 2023). Registration included 180 teams; 37 qualified for the final evaluation, with all code independently re-verified and final model performance ranked on the hidden test split.
Performance of the top three submissions (Protocol-3 test subset):
| Rank | Team | ACER (%) | APCER (%) | BPCER (%) | AUC (%) |
|---|---|---|---|---|---|
| 1 | MateoH | 4.73 | 5.07 | 4.38 | 98.38 |
| 2 | CTEL_AI | 5.56 | 9.20 | 1.91 | 98.21 |
| 3 | horsego | 6.22 | 8.17 | 4.26 | 96.97 |
All leading methods demonstrate robust performance against long-range, low-quality spoofs, though a significant error-rate increase remains for the lowest-quality bin as compared to the training-quality regime.
4. Top-ranked Algorithms and Methodological Advances
- MateoH (1st place):
- Backbone: ViT-Large.
- Progressive Training Strategy (PTS): iterative hard-sample mining with established decay to prevent catastrophic forgetting.
- Dynamic Feature Queue (DFQ): treats negative (“attack”) samples as a dynamic set of clusters, leveraging both a global negative center and a queue-based similarity for improved class discrimination. Logits computed per input as: (global) and (max similarity in queue), with both concatenated for the cross-entropy loss.
- CTEL_AI (2nd place):
- Backbone: ViT-Large.
- Adversarial Domain Generalization: quality-based domain adaptation via a Gradient Reversal Layer (GRL) and domain classifier to enforce feature invariance between image-quality splits. Loss combines live/attack cross-entropy with adversarial domain-loss.
- Augmentation: staged addition of dropout, blur, fog, posterize, patch shuffle, and other nuisance factors during training.
- horsego (3rd place):
- Dual-stream spatial-frequency architecture: spatial branch (RGB) and frequency branch (FFT, filtered, then IFFT) with EfficientFormerV2 encoders, features concatenated prior to classification.
- Optimizer: LION, providing accelerated convergence.
A recurring theme is that large, transformer-based backbones (ViT-Large, ConvNeXt, Swin) outperform smaller networks, especially in the adverse, low-quality regime.
5. Advances in Robust Long-distance PAD: CQIL and Related Directions
The CQIL (Contrastive Quality-Invariance Learning) framework was introduced to address quality-driven performance degradation (Fang et al., 2023). CQIL is structured as follows:
- Image Quality Variable (IQV) Module: Integrates super-resolution (e.g., ESRGAN) to generate paired high- and low-quality face crops (“”, “”), yielding explicit label assignment for contrastive learning.
- BYOL-style Contrastive Learning Branch: Leverages an online and target encoder for quality variant pairs, penalizing distance in the learned feature space and enforcing invariance.
- Separate Quality Network (SQN): Employs Central Difference Convolutions (CDC) for texture detail, adversarially disentangling quality features using a GRL-equipped classifier to enforce quality-agnostic discrimination.
The total CQIL training loss aggregates cross-entropy, contrastive, adversarial, CDC, and MSE-SR components with empirically tuned balancing parameters.
Empirical results indicate consistent performance gains for CQIL over baseline ResNet18 and CDCN networks under all protocols and quality regimes. For example, under Protocol-3, CQIL achieves ACER of 15.98% compared to ResNet18 at 17.64% and CDCN at 27.30%. t-SNE projections confirm tighter separation of live/attack clusters even under strong degradation, though the method fails under extreme occlusion or negligible face pixel counts (Fang et al., 2023).
6. Limitations and Research Directions
Several open directions and constraints are identified:
- Modalities: Current SuHiFiMask is restricted to RGB. Expansion to depth or IR imaging could introduce additional discriminative cues.
- Model Complexity: Super-resolution integration and dual-branch architectures increase inference cost; lightweight alternatives and compression are required for real-time operation.
- Generalization: There are persistent challenges in transferring trained PAD models to unseen domains (alternative camera hardware, adversarial lighting). Suggested remedies include meta-learning and self-supervised domain adaptation.
- Data Expansion: Increased subject diversity, dynamic backgrounds, and novel adversarial attacks are essential for comprehensive future-proofing.
- Innovative Paths: Integration of modern super-resolution (ESRGAN, BasicVSR, GCFSR), generative spoof synthesis, and systematic studies into large-model interpretability and robustness remain open avenues (Fang et al., 2023).
A plausible implication is that research utilizing SuHiFiMask will continue to shape PAD solutions for unconstrained surveillance environments, especially through joint advances in data realism, architectural robustness, and quality adaptation.
7. Significance and Community Impact
SuHiFiMask provides, for the first time, a foundation for large-scale, realistic evaluation and benchmarking of PAD solutions in operational surveillance conditions. Its combination of synchronized multi-camera capture, adversarial and high-fidelity attack simulation, and explicit image-quality–stratified protocols has shifted the community’s focus toward generalizable and robust anti-spoofing under operational, unconstrained nuisances. Evidence of the dataset’s catalytic impact is seen in accelerated advances in transformer architectures, contrastive representation learning, domain adaptation, and super-resolution–augmented pipelines for long-range FAS (Fang et al., 2023, Fang et al., 2023).