Central Difference Convolutional Network (CDCN)

Updated 14 March 2026

CDCN is a convolutional network that integrates both intensity aggregation and local gradient differences to enhance face anti-spoofing.
It employs a U-Net-like design with hierarchical feature extraction and a multiscale attention fusion module to produce accurate liveness masks.
Multi-modal extensions with RGB, depth, and IR inputs achieve state-of-the-art results, with an optimal theta of 0.7 providing significant ACER improvements.

Central Difference Convolutional Network (CDCN) is a convolutional neural network architecture designed for face anti-spoofing tasks, distinctively characterized by the central difference convolution (CDC) operator. By explicitly combining intensity and local gradient cues, CDCN is capable of capturing fine-grained textural structures and achieving enhanced robustness to domain shifts, such as those arising from presentation types, lighting, or ethnic variations. The CDCN framework and its variants, including CDCN++ and multi-modal extensions, have established state-of-the-art results on multiple face anti-spoofing benchmarks by capitalizing on this unique operator and architectural advances (Yu et al., 2020, Yu et al., 2020).

1. Central Difference Convolution: Mathematical Foundation

Traditional 2D convolution performs a weighted aggregation of pixel intensities over spatial neighborhoods. Let $x$ denote the input feature map, $y$ the output, and $\mathcal{R}$ the sampling grid (for example, a $3\times 3$ kernel):

$y(p_0) = \sum_{p_n \in \mathcal{R}} w(p_n) x(p_0 + p_n)$

In CDC, a local gradient term is integrated by aggregating the differences between each neighbor and the center pixel:

$y_{\mathrm{CD}}(p_0) = \sum_{p_n \in \mathcal{R}} w(p_n) \left[ x(p_0 + p_n) - x(p_0) \right]$

The generalized central difference convolution mixes both the intensity aggregation and the difference aggregation, parametrized by $\theta \in [0,1]$ :

$\begin{aligned} y(p_0) = \theta \sum_{p_n \in \mathcal{R}} w(p_n) \left[ x(p_0 + p_n) - x(p_0) \right] + (1-\theta) \sum_{p_n \in \mathcal{R}} w(p_n) x(p_0 + p_n) \end{aligned}$

When $\theta=0$ , this reduces to standard convolution; when $\theta=1$ , it is the pure gradient aggregation. Empirical studies have found $\theta=0.7$ to be optimal for RGB-based face anti-spoofing under domain shifts (Yu et al., 2020, Yu et al., 2020).

The canonical CDCN backbone for RGB input is a U-Net-like, mask-regression model. The architecture comprises:

Input: $256 \times 256 \times 3$ RGB image.
Feature Extraction: Three hierarchical “cells” (low-/mid-/high-level), each built from stacked CDC blocks. Early blocks expand channel widths via a two-stage CDC (CDC_2), followed by dimensionality reduction.
Spatial Down-Sampling: Max pooling by stride $2$ after each cell, yielding feature map sizes $128\times128$ (low), $64\times64$ (mid), and $32\times32$ (high).
Multiscale Attention Fusion Module (MAFM):
- Features from all levels are upsampled to $32\times32$ and concatenated.
- Channel-wise attention is computed and applied before fusion.
Regression Head: A $3\times3$ convolution compresses features to a $32\times32\times1$ output mask.

The output is a $32\times32$ mask predicting ‘liveness’ across the face region, with ground-truth binary masks (face, background) used during training (Yu et al., 2020).

To address robustness under severe domain shifts, CDCN can be extended to multi-modal inputs, specifically RGB, depth, and infrared (IR):

Branch Design: Each modality is processed by an independent CDCN branch (parameters not shared).
Fusion Strategies:

Feature-level fusion: Channel-wise concatenation of features from all modalities at the multi-scale stage, followed by joint regression heads.
Input-level fusion: Concatenates modalities as nine input channels to a single CDCN branch.
Score-level fusion: Each modality’s CDCN regresses a mask independently; final scores are averaged or weighted.

Results demonstrate that feature-level fusion is optimal on some protocols, while score averaging (especially RGB + depth) gives best results on others. IR performance is more sensitive to ethnicity/domain shift. The overall best protocol-averaged ACER achieved is $1.02\pm0.59\%$ in the ChaLearn Face Anti-spoofing Challenge (Track Multi-Modal) (Yu et al., 2020).

4. Losses, Optimization, and Training

CDCNs are trained to regress a liveness mask using a composite loss:

Pixel-wise Mean Squared Error (MSE) over the predicted and ground-truth masks.
Contrastive Depth Loss (CDL), leveraging eight fixed kernels that extract high-frequency depth details, penalizing discrepancy at multiple orientations.
Total Loss: $\mathcal{L}_\mathrm{overall} = \mathcal{L}_{\mathrm{MSE}} + \mathcal{L}_{\mathrm{CDL}}$

Optimization uses Adam with an initial learning rate of $1 \times 10^{-4}$ , weight decay $5 \times 10^{-5}$ , and a batch size of $8$. Training proceeds for up to $50$ epochs, with learning rate halved every $20$ epochs. The model is typically trained and evaluated on a single GPU (Yu et al., 2020).

5. Empirical Performance and Experimental Insights

Extensive ablation studies on anti-spoofing datasets reveal:

θ-Sensitivity: Single-modal RGB CDCN achieves lowest overall ACER at $\theta=0.7$ (6.02%); performance degrades for larger or smaller values.
Modal Comparison: Depth yields lowest single-modal ACER (2.73%), followed by RGB (6.02%), then IR (10.10%).
Fusion Effects: Feature-level multi-modal fusion outperforms input- and score-level on key protocols; score-level is preferable under particular domain shifts.
State-of-the-art Results: Best single-modal CDCN achieves 4.84 ± 1.79% ACER; best multi-modal 1.02 ± 0.59% ACER (Yu et al., 2020).
Generalization: CDCN and notably CDCN++ demonstrate marked superiority in cross-protocol, cross-dataset, and cross-attack settings, attributed to their robust encoding of local textural differences and invariance to lighting or ethnicity.

6. Significance and Impact

By introducing central difference operators into convolutional networks, CDCN enhances the extraction of fine-scale gradients and textural cues—properties crucial for distinguishing live from spoof faces, especially under presentation or hardware domain shifts. Depth cues yield strong shape discrimination, while gradients resist overfitting to color or illumination. Multi-modal CDCN architectures, via judicious fusion, exploit the complementary strengths of appearance, geometry, and spectral reflectance. This framework has set new baselines in face anti-spoofing evaluation and motivated further research into principled convolutional operator design and multimodal feature fusion (Yu et al., 2020, Yu et al., 2020). All code and trained models are released at the official repository: https://github.com/ZitongYu/CDCN.

Markdown Report Issue Upgrade to Chat

References (2)

Searching Central Difference Convolutional Networks for Face Anti-Spoofing (2020)

Multi-Modal Face Anti-Spoofing Based on Central Difference Networks (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Central Difference Convolutional Network (CDCN).

Central Difference Convolutional Network (CDCN)

1. Central Difference Convolution: Mathematical Foundation

4. Losses, Optimization, and Training

5. Empirical Performance and Experimental Insights

6. Significance and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Central Difference Convolutional Network (CDCN)

1. Central Difference Convolution: Mathematical Foundation

2. Single-Modal CDCN Architecture

3. Multi-Modal CDCN

4. Losses, Optimization, and Training

5. Empirical Performance and Experimental Insights

6. Significance and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research