Central Difference Convolutional Networks
- Central Difference Convolutional Networks (CDCN) are deep models that combine learnable intensity and central-difference gradient information to detect fine-grained spoofing cues.
- The CDC operation blends standard convolution with a central difference term governed by a learnable theta, balancing gradient and intensity features without extra parameters.
- Variants like CDCN++ and DC-CDN improve efficiency and generalization, achieving state-of-the-art results in face anti-spoofing benchmarks with minimal parameter overhead.
Central Difference Convolutional Networks (CDCN) are a class of deep architectures that generalize standard convolutional networks by integrating a learnable mixture of local intensity and central-difference gradient information at each spatial location. Initially developed for frame-level face anti-spoofing—specifically, liveness detection—they address deficiencies of stacked vanilla convolutional backbones, particularly in modeling fine-grained spoofing cues and robustness to domain variation. CDCN and its successors (e.g., CDCN++, DC-CDN) have demonstrated substantial improvements across intra-dataset, cross-dataset, cross-type, and multi-modal face anti-spoofing benchmarks, establishing new state-of-the-art accuracy with minimal parameter overhead and favorable trade-offs in efficiency and representation (Yu et al., 2020, Yu et al., 2020, Yu et al., 2021).
1. Mathematical Definition and Operator Design
The central operation underlying CDCN is the Central Difference Convolution (CDC). Let be an input feature map and a kernel with receptive field about spatial center . The output at is:
where . The parameter balances gradient-oriented (central difference) aggregation and conventional intensity aggregation. recovers standard convolution; 0 yields gradient-only aggregation. Optimal values are dataset- and task-dependent but typically 1–2 for FAS.
An efficient implementation expresses the CDC as the sum of a vanilla convolution and a per-pixel scaled input term:
3
No extra learnable parameters are introduced. In code, this is:
1 (Yu et al., 2020, Yu et al., 2020, Yu et al., 2021)
2. Architectural Instantiations: CDCN, CDCN++, and Extensions
The original CDCN replaces all 4 convolutions in the DepthNet-style FAS backbone (see Liu et al. 2018) with CDC layers, all using 5. The canonical CDCN pipeline is:
- Input: RGB face frame, 6
- Stem: 7 CDC (8 channels) 9 BatchNorm/ReLU
- Low/Mid/High Stages: Each comprises stacked 0 CDC blocks with increasing channels and 1 max-pooling (2); outputs at 3, 4, and 5.
- Feature Fusion: Channel-wise concatenation of Low, Mid, High features to form a 6 tensor.
- Head: Three 7 CDC layers output a 8 depth map.
- Loss: 9
CDCN++ employs neural architecture search (NAS) over a specialized CDC-based search space. The search space allows variable kernel sizes, expansion ratios, and “CDC_2@r” operations (two stacked CDC layers with channel expansion and projection). A DARTS/PC-DARTS-like gradient-based bi-level optimization is adopted. The derived backbone exhibits cell-wise heterogeneity and tailored connectivity, and achieves superior accuracy (e.g., 0 ACER on OULU-NPU Protocol 1) with reduced parameter count (Yu et al., 2020, Yu et al., 2020).
DC-CDN, introduced in (Yu et al., 2021), further decouples CDC into two sparse "cross" variants (C-CDC): Horizontal-Vertical (HV) and Diagonal (DG), forming dual parallel backbones. Cross Feature Interaction Modules (CFIM) adaptively fuse features from the HV and DG streams at each scale. This yields favorable efficiency–accuracy trade-offs: DC-CDN attains 1 parameter reduction and 2 fewer FLOPs compared to standard CDCN while often surpassing it in accuracy.
3. Multiscale Attention and Multi-Modal Fusion
Robust face anti-spoofing requires explicit multi-scale and possibly multi-modal feature integration. CDCN++ introduces the Multiscale Attention Fusion Module (MAFM). For each scale (low, mid, high), spatial attention masks are produced by:
- Channel-wise average and max pooling, concatenation
- Depthwise convolution with large kernels (3 for low/mid/high)
- Sigmoid activation to generate spatial masks
- Elementwise masking of the feature maps
Refined features are then concatenated and passed through the CDCN++ head (Yu et al., 2020).
For multi-modal FAS (e.g., RGB+depth+IR), (Yu et al., 2020) employs a CDCN branch per modality and concatenates their multi-scale outputs channel-wise before head fusion. Single-modal, feature-level, input-level, and score-level fusion strategies are compared, with feature- and score-level fusion providing the best generalization.
4. Cross Central Difference Convolution (C-CDC) and Dual-Cross Networks
Full CDC aggregates gradients over all eight neighborhood directions, but (Yu et al., 2021) observes redundancy and potential optimization inefficiency. By restricting CDC to "cross" neighborhoods (4 or 5, each of size 5 vs. 6), Cross Central Difference Convolution (C-CDC) is defined:
7
Empirically, C-CDC with either cross pattern achieves 8 to 9 ACER gain over full CDC, with 0 the parameters and a third of the FLOPs ((Yu et al., 2021), Table 1).
The Dual-Cross Central Difference Network (DC-CDN) fuses parallel C-CDC(HV) and C-CDC(DG) streams with CFIM:
1
where 2 is the sigmoid and 3 are learnable, per-scale scalars. This configuration achieves state-of-the-art accuracy and efficiency across multiple FAS benchmarks (Yu et al., 2021).
5. Training Protocols, Ablations, and Results
Datasets and Protocols:
- OULU-NPU (intra-dataset, four protocols; APCER, BPCER, ACER)
- SiW and SiW-M (cross-type, leave-one-out; ACER, EER)
- CASIA-MFSD, Replay-Attack, MSU-MFSD (cross-dataset/type; HTER, AUC)
Implementation Details:
- PRNet-derived pseudo-depth for supervision (depth 4 for live, 5 for spoof)
- Adam optimizer (lr=6), weight decay 7
- Loss: MSE plus contrastive depth (per (Yu et al., 2020, Yu et al., 2020, Yu et al., 2021))
- Test score: mean predicted depth
Key Ablations:
- 8 sweep reveals global optimum around 9–0 (e.g., ACER on OULU-P1: 1 at 2 vs 3 for 4) (Yu et al., 2020, Yu et al., 2020, Yu et al., 2021)
- Against vanilla, LBC, and Gabor convs, CDC achieves lowest error on FAS.
- MAFM and CFIM attention modules further improve generalization.
- Patch Exchange (PE) augmentation in (Yu et al., 2021), involving random region swaps between images with label consistency, improves cross-domain and intra-domain metrics.
Best performance examples:
| Method | Dataset/Protocol | ACER/EER/HTER |
|---|---|---|
| CDCN++ | OULU-NPU, Protocol-1 | 0.2% ACER |
| CDCN++ | SiW, Protocols 1–3 | (0.12%, 0.04%, 1.90%) ACER |
| CDCN++ | CASIA→Replay (cross) | 6.5% HTER |
| CDCN++ | SiW-M (13-way, cross-type) | 12.7% ACER, 11.9% EER |
| DC-CDN | OULU-NPU, Protocol-1 (+PE) | 0.4% ACER |
| DC-CDN | CASIA→Replay (cross) | 6.0% HTER |
| CDCN (multi-modal) | CASIA-SURF CeFA (MModal) | 1.02% ACER |
Efficiency:
- CDCN: 5M params, 6 FLOPs
- C-CDN: 7M params, 8 FLOPs
- DC-CDN: 9M params, 0 FLOPs
6. Significance, Generalization, and Visualization
CDCN and its derivatives provide strong domain robustness, evident from consistent validation–test metrics under domain shift (e.g., illumination, print/replay/3D-mask attacks, cross-ethnicity scenarios). t-SNE visualizations demonstrate that CDC-based feature representations more clearly separate live from spoof samples than vanilla CNNs.
The multi-modal extension shows that depth consistently outperforms RGB and IR under most protocols; fusion models further raise the ceiling. Feature map analyses reveal that CDC layers and MAFM/CFIM modules enhance attention toward spoof-relevant artifacts. A plausible implication is that CDC and its efficient variants capture local micro-patterns critical to detecting presentation attacks, which standard convolution and hand-crafted feature analogs insufficiently address.
7. Related Methods and Future Directions
Central Difference Convolution has inspired a suite of efficient, parameter-sensitive, and robust backbone designs for fine-grained pattern analysis. Recent work including C-CDC and dual-stream fusion advances push CDC's trade-off frontier further, making it a building block not just for face anti-spoofing, but for broader pixel-wise classification and detection tasks sensitive to microstructure and domain variability (Yu et al., 2020, Yu et al., 2020, Yu et al., 2021). Refinements such as plug-and-play augmentations (PE), and cross-modal/multi-scale attention, provide general templates for extending CDCN to new datasets and modalities.