Papers
Topics
Authors
Recent
2000 character limit reached

DSFD: Dual Shot Face Detector

Updated 22 November 2025
  • The paper introduces DSFD, a dual-shot face detector that uses a two-stage inference procedure to improve small and multi-scale face detection.
  • It leverages key innovations—Feature Enhance Module (FEM), Progressive Anchor Loss (PAL), and Improved Anchor Matching (IAM)—to boost feature learning and anchor assignment.
  • Extensive experiments on WIDER FACE and FDDB benchmarks validate DSFD's high accuracy and efficient deployment in real-world face detection scenarios.

The Dual Shot Face Detector (DSFD) is an advanced face detection framework that achieves superior accuracy by addressing fundamental challenges in feature learning, supervision strategy across scales, and anchor assignment as grounded in data augmentation. DSFD forgoes the standard single-shot detection paradigm, introducing a two-stage (“dual shot”) inference and training procedure over the same backbone to enhance detection of faces across a broad range of scales. Its core innovations—Feature Enhance Module (FEM), Progressive Anchor Loss (PAL), and Improved Anchor Matching (IAM)—collectively allow DSFD to outperform prior state-of-the-art detectors on the WIDER FACE and FDDB benchmarks (Li et al., 2018).

1. Architectural Structure and Dual-Shot Paradigm

DSFD extends single-stage detectors like SSD into a dual-shot pipeline using a shared convolutional backbone (e.g., VGG16 or ResNet). The architecture comprises:

  • First Shot: Utilizes original feature maps {ofl}l=16\{of_l\}_{l=1}^6 extracted from intermediate backbone layers (i.e., conv3_3, conv4_3, conv5_3, fc7, conv6_2, conv7_2). A detection head, structurally similar to SSD, operates with smaller anchors salsa_l to better supervise the early layers toward high-resolution, small-scale face localization.
  • Feature Enhance Module (FEM): Processes each oflof_l to produce spatially aligned enhanced feature maps eflef_l, capturing multi-scale and large-receptive field context (see Section 2).
  • Second Shot: Another SSD-style head is dispatched over eflef_l with anchor sizes ala_l (where sal=0.5alsa_l = 0.5 a_l). Only this head’s outputs are used during inference, so the dual-shot strategy incurs no extra time cost per forward pass.

The overall data flow can be conceptualized as follows:

Step Data Operation/Output
Input image Backbone → {of1,,of6}\{of_1,\dots,of_6\}
First Shot Head {ofl}\{of_l\}, salsa_l Early-scale loss supervision
FEM {ofl}\{of_l\} Enhanced {efl}\{ef_l\}
Second Shot Head {efl}\{ef_l\}, ala_l Detection/prediction (output)

2. Feature Enhance Module (FEM)

2.1 Motivation

Standard FPNs fuse high- and low-level features via lateral connections and RFB blocks focus on dilation-based context expansion but do not combine both strategies. FEM merges multi-level context aggregation and enlarged receptive fields, thus enhancing semantic richness and spatial coverage in feature maps.

2.2 Structure and Mathematical Formulation

For each layer ll:

  1. Apply a 1×11 \times 1 convolution to normalize the feature map oflof_l.
  2. Bilinearly up-sample ofl+1of_{l+1} to match the H×WH \times W spatial size, performing element-wise multiplication with the normalized oflof_l.
  3. Split resulting tensor into three branches; each passes through stacked dilated convolutions at rates e1e_1, e2e_2, e3e_3.
  4. Concatenate the branch outputs along the channel axis, yielding enhanced feature eflef_l.

Explicitly, for spatial location (i,j)(i,j): nci,j,l=fprod(oci,j,l,  fup(oci,j,l+1)) eci,j,l=fconcat(fdilation(1)(nci,j,l),  fdilation(2)(nci,j,l),  fdilation(3)(nci,j,l))\begin{align*} nc_{i,j,l} &= f_\text{prod}\left(oc_{i,j,l},\; f_\text{up}(oc_{i,j,l+1})\right) \ ec_{i,j,l} &= f_\text{concat}\Bigl(\, f_\text{dilation}^{(1)}(nc_{i,j,l}),\; f_\text{dilation}^{(2)}(nc_{i,j,l}),\; f_\text{dilation}^{(3)}(nc_{i,j,l})\,\Bigr) \end{align*} where fupf_\text{up} denotes up-sampling, fprodf_\text{prod} is element-wise product, and fdilation(k)f_\text{dilation}^{(k)} are parallel dilated convolution branches.

This design generalizes the standard SSD by introducing for each layer a semantically parallel “shot” with higher receptive field per layer.

3. Progressive Anchor Loss (PAL)

3.1 Rationale and Implementation

PAL leverages the intuition that lower-level (earlier) features are optimal for small-object localization, while higher-level (or enhanced) features are advantageous for larger objects due to greater semantic aggregation.

  • First Shot Loss (FSL): Supervises oflof_l with small anchors salsa_l.
  • Second Shot Loss (SSL): Supervises eflef_l with standard (larger) anchors ala_l.
  • Combined Loss: Both loss terms are combined with a balancing coefficient λ\lambda (set to $1$ in practice):

$\begin{align*} \mathcal{L}_{SSL} &= \frac{1}{N_{conf}}\sum_i L_{conf}(p_i,p_i^*) + \frac{\beta}{N_{loc}}\sum_i p_i^*\,L_{loc}(t_i,g_i|a_i) \tag{2} \ \mathcal{L}_{FSL} &= \frac{1}{N_{conf}}\sum_i L_{conf}(p_i,p_i^*) + \frac{\beta}{N_{loc}}\sum_i p_i^*\,L_{loc}(t_i,g_i|sa_i) \tag{3} \ \mathcal{L}_{PAL} &= \mathcal{L}_{FSL} + \lambda\,\mathcal{L}_{SSL} \tag{4} \end{align*}$

This progressive supervision enables the backbone to specialize: first-shot for finer localization of small faces, second-shot for semantic robustness in medium to large faces.

4. Improved Anchor Matching (IAM)

4.1 Motivation and Core Procedure

Conventional anchor assignment based solely on IoU thresholds inadequately samples small faces, resulting in insufficient positive anchors. IAM synchronizes data augmentation with anchor scale assignment to maximize the ratio of positive anchor matches per face, crucial for learning robust regressors.

With $0.4$ probability in each sample:

  • An anchor-based sampling is performed: a ground-truth face of size SfaceS_{face} is chosen, a target anchor scale Sanchor{16,32,64,128,256,512}S_{anchor} \in \{16, 32, 64, 128, 256, 512\} is uniformly sampled, and a sub-image is cropped so the resized face matches this anchor. The crop is resized to 640×640640 \times 640. With $0.6$ probability, standard SSD augmentations (random crop/flip/color shift) are applied.

Anchor-to-ground-truth assignments are:

  • Positive: IoU0.4\text{IoU} \ge 0.4
  • Negative: IoU0.3\text{IoU} \le 0.3

This approach increases mean anchor matches per face (from 6.4\approx6.4 to 6.9\approx6.9), reduces anchor/face scale mismatch, and accelerates the convergence and stability of bounding-box regression.

5. Training Protocol and Implementation

5.1 Anchor, Augmentation, and Hyperparameter Specification

Anchors are assigned across six feature map levels (l=1l=1 to $6$), with corresponding strides of {4,8,16,32,64,128}\{4, 8, 16, 32, 64, 128\} for input size 640×640640 \times 640 pixels, producing feature maps at {1602,802,402,202,102,52}\{160^2, 80^2, 40^2, 20^2, 10^2, 5^2\}. Anchor scales:

  • Second-shot (ala_l): {16,32,64,128,256,512}\{16, 32, 64, 128, 256, 512\} pixels
  • First-shot (salsa_l): 0.5al0.5 a_l

All anchors use a $1.5:1$ aspect ratio based on face dataset statistics.

Data augmentation applies a blend of photometric distortions, random horizontal flip, anchor-based sampling, and standard SSD random crops.

Optimization uses SGD with momentum $0.9$, weight decay 5×1045 \times 10^{-4}, and a learning rate schedule of 1×1031 \times 10^{-3} (first $40k$ steps), 1×1041 \times 10^{-4} (next $10k$), and 1×1051 \times 10^{-5} (final $10k$). Batch size is $16$ over $4$ GPUs (60k\approx60k total steps). Backbones are ImageNet-pretrained (VGG16, ResNet-50/101/152), with new layers initialized by Xavier.

6. Experimental Evaluation

6.1 WIDER FACE and FDDB Metrics

DSFD outperforms contemporaneous detectors (e.g., PyramidBox, SRN, SSH, S³FD), exhibiting the following performance:

Dataset-Split Easy Medium Hard
WIDER-val (AP, %) 96.6 95.7 90.4
WIDER-test (AP, %) 96.0 95.3 90.0

On the FDDB benchmark (recall at $1000$ false positives):

  • Discrete ROC: 99.1%99.1\%
  • Continuous ROC: 86.2%86.2\%

6.2 Ablation Analysis

Ablation studies validate each module's contribution:

  • FEM (on VGG16-FSSD baseline): +0.4% (easy), +1.2% (medium), +5.5% (hard)
  • PAL (on ResNet-50+FEM): +0.3% (easy), +0.3% (medium), +0.6% (hard)
  • IAM (on ResNet-101+FEM): +0.3% (easy), +0.1% (medium), +0.3% (hard)

6.3 Inference Speed

DSFD achieves 22\approx22 FPS (ResNet-50, VGA resolution, NVIDIA P40), with negligible runtime overhead compared to a single-shot SSD, as only the second-shot head is used at inference.

7. Practical Considerations and Deployment

7.1 Reproducibility and Code Access

Canonical code and pretrained models for DSFD are located at https://github.com/TencentYoutuResearch/FaceDetection-DSFD. Training is conducted solely on the WIDER FACE dataset following the specified schedule and augmentations.

7.2 Backbone Selection

Deeper backbones (SE-ResNeXt101, DPN-98) offer incremental AP improvements but increase inference cost; DSFD remains backbone-agnostic and supports trade-offs for real-time scenarios (e.g., ResNet-50 or MobileNet).

7.3 Deployment Guidance

For inference, the first-shot head is omitted. Image pyramids ($0.5$–2.0×2.0\times input scale) are advised in cases of extreme face scale variability. Quantization or TensorRT conversion can further accelerate deployment.


DSFD’s architectural, supervisory, and augmentation innovations enable accurate detection over a wide range of face scales, rendering it an effective baseline for face detection research and application (Li et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Dual Shot Face Detector (DSFD).