Papers
Topics
Authors
Recent
Search
2000 character limit reached

Central Difference Convolutional Network (CDCN)

Updated 14 March 2026
  • CDCN is a convolutional network that integrates both intensity aggregation and local gradient differences to enhance face anti-spoofing.
  • It employs a U-Net-like design with hierarchical feature extraction and a multiscale attention fusion module to produce accurate liveness masks.
  • Multi-modal extensions with RGB, depth, and IR inputs achieve state-of-the-art results, with an optimal theta of 0.7 providing significant ACER improvements.

Central Difference Convolutional Network (CDCN) is a convolutional neural network architecture designed for face anti-spoofing tasks, distinctively characterized by the central difference convolution (CDC) operator. By explicitly combining intensity and local gradient cues, CDCN is capable of capturing fine-grained textural structures and achieving enhanced robustness to domain shifts, such as those arising from presentation types, lighting, or ethnic variations. The CDCN framework and its variants, including CDCN++ and multi-modal extensions, have established state-of-the-art results on multiple face anti-spoofing benchmarks by capitalizing on this unique operator and architectural advances (Yu et al., 2020, Yu et al., 2020).

1. Central Difference Convolution: Mathematical Foundation

Traditional 2D convolution performs a weighted aggregation of pixel intensities over spatial neighborhoods. Let xx denote the input feature map, yy the output, and R\mathcal{R} the sampling grid (for example, a 3×33\times 3 kernel):

y(p0)=pnRw(pn)x(p0+pn)y(p_0) = \sum_{p_n \in \mathcal{R}} w(p_n) x(p_0 + p_n)

In CDC, a local gradient term is integrated by aggregating the differences between each neighbor and the center pixel:

yCD(p0)=pnRw(pn)[x(p0+pn)x(p0)]y_{\mathrm{CD}}(p_0) = \sum_{p_n \in \mathcal{R}} w(p_n) \left[ x(p_0 + p_n) - x(p_0) \right]

The generalized central difference convolution mixes both the intensity aggregation and the difference aggregation, parametrized by θ[0,1]\theta \in [0,1]:

y(p0)=θpnRw(pn)[x(p0+pn)x(p0)]+(1θ)pnRw(pn)x(p0+pn)\begin{aligned} y(p_0) = \theta \sum_{p_n \in \mathcal{R}} w(p_n) \left[ x(p_0 + p_n) - x(p_0) \right] + (1-\theta) \sum_{p_n \in \mathcal{R}} w(p_n) x(p_0 + p_n) \end{aligned}

When θ=0\theta=0, this reduces to standard convolution; when θ=1\theta=1, it is the pure gradient aggregation. Empirical studies have found θ=0.7\theta=0.7 to be optimal for RGB-based face anti-spoofing under domain shifts (Yu et al., 2020, Yu et al., 2020).

2. Single-Modal CDCN Architecture

The canonical CDCN backbone for RGB input is a U-Net-like, mask-regression model. The architecture comprises:

  • Input: 256×256×3256 \times 256 \times 3 RGB image.
  • Feature Extraction: Three hierarchical “cells” (low-/mid-/high-level), each built from stacked CDC blocks. Early blocks expand channel widths via a two-stage CDC (CDC_2), followed by dimensionality reduction.
  • Spatial Down-Sampling: Max pooling by stride $2$ after each cell, yielding feature map sizes 128×128128\times128 (low), 64×6464\times64 (mid), and 32×3232\times32 (high).
  • Multiscale Attention Fusion Module (MAFM):
    • Features from all levels are upsampled to 32×3232\times32 and concatenated.
    • Channel-wise attention is computed and applied before fusion.
  • Regression Head: A 3×33\times3 convolution compresses features to a 32×32×132\times32\times1 output mask.

The output is a 32×3232\times32 mask predicting ‘liveness’ across the face region, with ground-truth binary masks (face, background) used during training (Yu et al., 2020).

3. Multi-Modal CDCN

To address robustness under severe domain shifts, CDCN can be extended to multi-modal inputs, specifically RGB, depth, and infrared (IR):

  • Branch Design: Each modality is processed by an independent CDCN branch (parameters not shared).
  • Fusion Strategies:
  1. Feature-level fusion: Channel-wise concatenation of features from all modalities at the multi-scale stage, followed by joint regression heads.
  2. Input-level fusion: Concatenates modalities as nine input channels to a single CDCN branch.
  3. Score-level fusion: Each modality’s CDCN regresses a mask independently; final scores are averaged or weighted.

Results demonstrate that feature-level fusion is optimal on some protocols, while score averaging (especially RGB + depth) gives best results on others. IR performance is more sensitive to ethnicity/domain shift. The overall best protocol-averaged ACER achieved is 1.02±0.59%1.02\pm0.59\% in the ChaLearn Face Anti-spoofing Challenge (Track Multi-Modal) (Yu et al., 2020).

4. Losses, Optimization, and Training

CDCNs are trained to regress a liveness mask using a composite loss:

  • Pixel-wise Mean Squared Error (MSE) over the predicted and ground-truth masks.
  • Contrastive Depth Loss (CDL), leveraging eight fixed kernels that extract high-frequency depth details, penalizing discrepancy at multiple orientations.
  • Total Loss: Loverall=LMSE+LCDL\mathcal{L}_\mathrm{overall} = \mathcal{L}_{\mathrm{MSE}} + \mathcal{L}_{\mathrm{CDL}}

Optimization uses Adam with an initial learning rate of 1×1041 \times 10^{-4}, weight decay 5×1055 \times 10^{-5}, and a batch size of $8$. Training proceeds for up to $50$ epochs, with learning rate halved every $20$ epochs. The model is typically trained and evaluated on a single GPU (Yu et al., 2020).

5. Empirical Performance and Experimental Insights

Extensive ablation studies on anti-spoofing datasets reveal:

  • θ-Sensitivity: Single-modal RGB CDCN achieves lowest overall ACER at θ=0.7\theta=0.7 (6.02%); performance degrades for larger or smaller values.
  • Modal Comparison: Depth yields lowest single-modal ACER (2.73%), followed by RGB (6.02%), then IR (10.10%).
  • Fusion Effects: Feature-level multi-modal fusion outperforms input- and score-level on key protocols; score-level is preferable under particular domain shifts.
  • State-of-the-art Results: Best single-modal CDCN achieves 4.84 ± 1.79% ACER; best multi-modal 1.02 ± 0.59% ACER (Yu et al., 2020).
  • Generalization: CDCN and notably CDCN++ demonstrate marked superiority in cross-protocol, cross-dataset, and cross-attack settings, attributed to their robust encoding of local textural differences and invariance to lighting or ethnicity.

6. Significance and Impact

By introducing central difference operators into convolutional networks, CDCN enhances the extraction of fine-scale gradients and textural cues—properties crucial for distinguishing live from spoof faces, especially under presentation or hardware domain shifts. Depth cues yield strong shape discrimination, while gradients resist overfitting to color or illumination. Multi-modal CDCN architectures, via judicious fusion, exploit the complementary strengths of appearance, geometry, and spectral reflectance. This framework has set new baselines in face anti-spoofing evaluation and motivated further research into principled convolutional operator design and multimodal feature fusion (Yu et al., 2020, Yu et al., 2020). All code and trained models are released at the official repository: https://github.com/ZitongYu/CDCN.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Central Difference Convolutional Network (CDCN).