Papers
Topics
Authors
Recent
Search
2000 character limit reached

Central Difference Convolutional Networks

Updated 10 April 2026
  • Central Difference Convolutional Networks (CDCN) are deep models that combine learnable intensity and central-difference gradient information to detect fine-grained spoofing cues.
  • The CDC operation blends standard convolution with a central difference term governed by a learnable theta, balancing gradient and intensity features without extra parameters.
  • Variants like CDCN++ and DC-CDN improve efficiency and generalization, achieving state-of-the-art results in face anti-spoofing benchmarks with minimal parameter overhead.

Central Difference Convolutional Networks (CDCN) are a class of deep architectures that generalize standard convolutional networks by integrating a learnable mixture of local intensity and central-difference gradient information at each spatial location. Initially developed for frame-level face anti-spoofing—specifically, liveness detection—they address deficiencies of stacked vanilla convolutional backbones, particularly in modeling fine-grained spoofing cues and robustness to domain variation. CDCN and its successors (e.g., CDCN++, DC-CDN) have demonstrated substantial improvements across intra-dataset, cross-dataset, cross-type, and multi-modal face anti-spoofing benchmarks, establishing new state-of-the-art accuracy with minimal parameter overhead and favorable trade-offs in efficiency and representation (Yu et al., 2020, Yu et al., 2020, Yu et al., 2021).

1. Mathematical Definition and Operator Design

The central operation underlying CDCN is the Central Difference Convolution (CDC). Let x:Z2Rx: \mathbb{Z}^2 \to \mathbb{R} be an input feature map and w:RRw: \mathcal{R} \to \mathbb{R} a k×kk \times k kernel with receptive field R\mathcal{R} about spatial center p0p_0. The output at p0p_0 is:

y(p0)=θpnRw(pn)[x(p0+pn)x(p0)]+(1θ)pnRw(pn)x(p0+pn)y(p_0) = \theta \sum_{p_n \in \mathcal{R}} w(p_n) [x(p_0 + p_n) - x(p_0)] + (1-\theta)\sum_{p_n \in \mathcal{R}} w(p_n) x(p_0 + p_n)

where θ[0,1]\theta \in [0,1]. The θ\theta parameter balances gradient-oriented (central difference) aggregation and conventional intensity aggregation. θ=0\theta=0 recovers standard convolution; w:RRw: \mathcal{R} \to \mathbb{R}0 yields gradient-only aggregation. Optimal values are dataset- and task-dependent but typically w:RRw: \mathcal{R} \to \mathbb{R}1–w:RRw: \mathcal{R} \to \mathbb{R}2 for FAS.

An efficient implementation expresses the CDC as the sum of a vanilla convolution and a per-pixel scaled input term:

w:RRw: \mathcal{R} \to \mathbb{R}3

No extra learnable parameters are introduced. In code, this is:

y(p0)=θpnRw(pn)[x(p0+pn)x(p0)]+(1θ)pnRw(pn)x(p0+pn)y(p_0) = \theta \sum_{p_n \in \mathcal{R}} w(p_n) [x(p_0 + p_n) - x(p_0)] + (1-\theta)\sum_{p_n \in \mathcal{R}} w(p_n) x(p_0 + p_n)1 (Yu et al., 2020, Yu et al., 2020, Yu et al., 2021)

2. Architectural Instantiations: CDCN, CDCN++, and Extensions

The original CDCN replaces all w:RRw: \mathcal{R} \to \mathbb{R}4 convolutions in the DepthNet-style FAS backbone (see Liu et al. 2018) with CDC layers, all using w:RRw: \mathcal{R} \to \mathbb{R}5. The canonical CDCN pipeline is:

  • Input: RGB face frame, w:RRw: \mathcal{R} \to \mathbb{R}6
  • Stem: w:RRw: \mathcal{R} \to \mathbb{R}7 CDC (w:RRw: \mathcal{R} \to \mathbb{R}8 channels) w:RRw: \mathcal{R} \to \mathbb{R}9 BatchNorm/ReLU
  • Low/Mid/High Stages: Each comprises stacked k×kk \times k0 CDC blocks with increasing channels and k×kk \times k1 max-pooling (k×kk \times k2); outputs at k×kk \times k3, k×kk \times k4, and k×kk \times k5.
  • Feature Fusion: Channel-wise concatenation of Low, Mid, High features to form a k×kk \times k6 tensor.
  • Head: Three k×kk \times k7 CDC layers output a k×kk \times k8 depth map.
  • Loss: k×kk \times k9

CDCN++ employs neural architecture search (NAS) over a specialized CDC-based search space. The search space allows variable kernel sizes, expansion ratios, and “CDC_2@r” operations (two stacked CDC layers with channel expansion and projection). A DARTS/PC-DARTS-like gradient-based bi-level optimization is adopted. The derived backbone exhibits cell-wise heterogeneity and tailored connectivity, and achieves superior accuracy (e.g., R\mathcal{R}0 ACER on OULU-NPU Protocol 1) with reduced parameter count (Yu et al., 2020, Yu et al., 2020).

DC-CDN, introduced in (Yu et al., 2021), further decouples CDC into two sparse "cross" variants (C-CDC): Horizontal-Vertical (HV) and Diagonal (DG), forming dual parallel backbones. Cross Feature Interaction Modules (CFIM) adaptively fuse features from the HV and DG streams at each scale. This yields favorable efficiency–accuracy trade-offs: DC-CDN attains R\mathcal{R}1 parameter reduction and R\mathcal{R}2 fewer FLOPs compared to standard CDCN while often surpassing it in accuracy.

3. Multiscale Attention and Multi-Modal Fusion

Robust face anti-spoofing requires explicit multi-scale and possibly multi-modal feature integration. CDCN++ introduces the Multiscale Attention Fusion Module (MAFM). For each scale (low, mid, high), spatial attention masks are produced by:

  1. Channel-wise average and max pooling, concatenation
  2. Depthwise convolution with large kernels (R\mathcal{R}3 for low/mid/high)
  3. Sigmoid activation to generate spatial masks
  4. Elementwise masking of the feature maps

Refined features are then concatenated and passed through the CDCN++ head (Yu et al., 2020).

For multi-modal FAS (e.g., RGB+depth+IR), (Yu et al., 2020) employs a CDCN branch per modality and concatenates their multi-scale outputs channel-wise before head fusion. Single-modal, feature-level, input-level, and score-level fusion strategies are compared, with feature- and score-level fusion providing the best generalization.

4. Cross Central Difference Convolution (C-CDC) and Dual-Cross Networks

Full CDC aggregates gradients over all eight neighborhood directions, but (Yu et al., 2021) observes redundancy and potential optimization inefficiency. By restricting CDC to "cross" neighborhoods (R\mathcal{R}4 or R\mathcal{R}5, each of size 5 vs. R\mathcal{R}6), Cross Central Difference Convolution (C-CDC) is defined:

R\mathcal{R}7

Empirically, C-CDC with either cross pattern achieves R\mathcal{R}8 to R\mathcal{R}9 ACER gain over full CDC, with p0p_00 the parameters and a third of the FLOPs ((Yu et al., 2021), Table 1).

The Dual-Cross Central Difference Network (DC-CDN) fuses parallel C-CDC(HV) and C-CDC(DG) streams with CFIM:

p0p_01

where p0p_02 is the sigmoid and p0p_03 are learnable, per-scale scalars. This configuration achieves state-of-the-art accuracy and efficiency across multiple FAS benchmarks (Yu et al., 2021).

5. Training Protocols, Ablations, and Results

Datasets and Protocols:

  • OULU-NPU (intra-dataset, four protocols; APCER, BPCER, ACER)
  • SiW and SiW-M (cross-type, leave-one-out; ACER, EER)
  • CASIA-MFSD, Replay-Attack, MSU-MFSD (cross-dataset/type; HTER, AUC)

Implementation Details:

Key Ablations:

  • p0p_08 sweep reveals global optimum around p0p_09–p0p_00 (e.g., ACER on OULU-P1: p0p_01 at p0p_02 vs p0p_03 for p0p_04) (Yu et al., 2020, Yu et al., 2020, Yu et al., 2021)
  • Against vanilla, LBC, and Gabor convs, CDC achieves lowest error on FAS.
  • MAFM and CFIM attention modules further improve generalization.
  • Patch Exchange (PE) augmentation in (Yu et al., 2021), involving random region swaps between images with label consistency, improves cross-domain and intra-domain metrics.

Best performance examples:

Method Dataset/Protocol ACER/EER/HTER
CDCN++ OULU-NPU, Protocol-1 0.2% ACER
CDCN++ SiW, Protocols 1–3 (0.12%, 0.04%, 1.90%) ACER
CDCN++ CASIA→Replay (cross) 6.5% HTER
CDCN++ SiW-M (13-way, cross-type) 12.7% ACER, 11.9% EER
DC-CDN OULU-NPU, Protocol-1 (+PE) 0.4% ACER
DC-CDN CASIA→Replay (cross) 6.0% HTER
CDCN (multi-modal) CASIA-SURF CeFA (MModal) 1.02% ACER

Efficiency:

  • CDCN: p0p_05M params, p0p_06 FLOPs
  • C-CDN: p0p_07M params, p0p_08 FLOPs
  • DC-CDN: p0p_09M params, y(p0)=θpnRw(pn)[x(p0+pn)x(p0)]+(1θ)pnRw(pn)x(p0+pn)y(p_0) = \theta \sum_{p_n \in \mathcal{R}} w(p_n) [x(p_0 + p_n) - x(p_0)] + (1-\theta)\sum_{p_n \in \mathcal{R}} w(p_n) x(p_0 + p_n)0 FLOPs

6. Significance, Generalization, and Visualization

CDCN and its derivatives provide strong domain robustness, evident from consistent validation–test metrics under domain shift (e.g., illumination, print/replay/3D-mask attacks, cross-ethnicity scenarios). t-SNE visualizations demonstrate that CDC-based feature representations more clearly separate live from spoof samples than vanilla CNNs.

The multi-modal extension shows that depth consistently outperforms RGB and IR under most protocols; fusion models further raise the ceiling. Feature map analyses reveal that CDC layers and MAFM/CFIM modules enhance attention toward spoof-relevant artifacts. A plausible implication is that CDC and its efficient variants capture local micro-patterns critical to detecting presentation attacks, which standard convolution and hand-crafted feature analogs insufficiently address.

Central Difference Convolution has inspired a suite of efficient, parameter-sensitive, and robust backbone designs for fine-grained pattern analysis. Recent work including C-CDC and dual-stream fusion advances push CDC's trade-off frontier further, making it a building block not just for face anti-spoofing, but for broader pixel-wise classification and detection tasks sensitive to microstructure and domain variability (Yu et al., 2020, Yu et al., 2020, Yu et al., 2021). Refinements such as plug-and-play augmentations (PE), and cross-modal/multi-scale attention, provide general templates for extending CDCN to new datasets and modalities.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Central Difference Convolutional Networks (CDCN).