DSFD: Dual Shot Face Detector

Updated 22 November 2025

The paper introduces DSFD, a dual-shot face detector that uses a two-stage inference procedure to improve small and multi-scale face detection.
It leverages key innovations—Feature Enhance Module (FEM), Progressive Anchor Loss (PAL), and Improved Anchor Matching (IAM)—to boost feature learning and anchor assignment.
Extensive experiments on WIDER FACE and FDDB benchmarks validate DSFD's high accuracy and efficient deployment in real-world face detection scenarios.

The Dual Shot Face Detector (DSFD) is an advanced face detection framework that achieves superior accuracy by addressing fundamental challenges in feature learning, supervision strategy across scales, and anchor assignment as grounded in data augmentation. DSFD forgoes the standard single-shot detection paradigm, introducing a two-stage (“dual shot”) inference and training procedure over the same backbone to enhance detection of faces across a broad range of scales. Its core innovations—Feature Enhance Module (FEM), Progressive Anchor Loss (PAL), and Improved Anchor Matching (IAM)—collectively allow DSFD to outperform prior state-of-the-art detectors on the WIDER FACE and FDDB benchmarks (Li et al., 2018).

1. Architectural Structure and Dual-Shot Paradigm

DSFD extends single-stage detectors like SSD into a dual-shot pipeline using a shared convolutional backbone (e.g., VGG16 or ResNet). The architecture comprises:

First Shot: Utilizes original feature maps $\{of_l\}_{l=1}^6$ extracted from intermediate backbone layers (i.e., conv3_3, conv4_3, conv5_3, fc7, conv6_2, conv7_2). A detection head, structurally similar to SSD, operates with smaller anchors $sa_l$ to better supervise the early layers toward high-resolution, small-scale face localization.
Feature Enhance Module (FEM): Processes each $of_l$ to produce spatially aligned enhanced feature maps $ef_l$ , capturing multi-scale and large-receptive field context (see Section 2).
Second Shot: Another SSD-style head is dispatched over $ef_l$ with anchor sizes $a_l$ (where $sa_l = 0.5 a_l$ ). Only this head’s outputs are used during inference, so the dual-shot strategy incurs no extra time cost per forward pass.

The overall data flow can be conceptualized as follows:

Step	Data	Operation/Output
Input image	—	Backbone → $\{of_1,\dots,of_6\}$
First Shot Head	$\{of_l\}$ , $sa_l$	Early-scale loss supervision
FEM	$\{of_l\}$	Enhanced $\{ef_l\}$
Second Shot Head	$\{ef_l\}$ , $a_l$	Detection/prediction (output)

2. Feature Enhance Module (FEM)

2.1 Motivation

Standard FPNs fuse high- and low-level features via lateral connections and RFB blocks focus on dilation-based context expansion but do not combine both strategies. FEM merges multi-level context aggregation and enlarged receptive fields, thus enhancing semantic richness and spatial coverage in feature maps.

2.2 Structure and Mathematical Formulation

For each layer $l$ :

Apply a $1 \times 1$ convolution to normalize the feature map $of_l$ .
Bilinearly up-sample $of_{l+1}$ to match the $H \times W$ spatial size, performing element-wise multiplication with the normalized $of_l$ .
Split resulting tensor into three branches; each passes through stacked dilated convolutions at rates $e_1$ , $e_2$ , $e_3$ .
Concatenate the branch outputs along the channel axis, yielding enhanced feature $ef_l$ .

Explicitly, for spatial location $(i,j)$ : $\begin{align*} nc_{i,j,l} &= f_\text{prod}\left(oc_{i,j,l},\; f_\text{up}(oc_{i,j,l+1})\right) \ ec_{i,j,l} &= f_\text{concat}\Bigl(\, f_\text{dilation}^{(1)}(nc_{i,j,l}),\; f_\text{dilation}^{(2)}(nc_{i,j,l}),\; f_\text{dilation}^{(3)}(nc_{i,j,l})\,\Bigr) \end{align*}$ where $f_\text{up}$ denotes up-sampling, $f_\text{prod}$ is element-wise product, and $f_\text{dilation}^{(k)}$ are parallel dilated convolution branches.

This design generalizes the standard SSD by introducing for each layer a semantically parallel “shot” with higher receptive field per layer.

3. Progressive Anchor Loss (PAL)

3.1 Rationale and Implementation

PAL leverages the intuition that lower-level (earlier) features are optimal for small-object localization, while higher-level (or enhanced) features are advantageous for larger objects due to greater semantic aggregation.

First Shot Loss (FSL): Supervises $of_l$ with small anchors $sa_l$ .
Second Shot Loss (SSL): Supervises $ef_l$ with standard (larger) anchors $a_l$ .
Combined Loss: Both loss terms are combined with a balancing coefficient $\lambda$ (set to $1$ in practice):

$\begin{align*} \mathcal{L}_{SSL} &= \frac{1}{N_{conf}}\sum_i L_{conf}(p_i,p_i^*) + \frac{\beta}{N_{loc}}\sum_i p_i^*\,L_{loc}(t_i,g_i|a_i) \tag{2} \ \mathcal{L}_{FSL} &= \frac{1}{N_{conf}}\sum_i L_{conf}(p_i,p_i^*) + \frac{\beta}{N_{loc}}\sum_i p_i^*\,L_{loc}(t_i,g_i|sa_i) \tag{3} \ \mathcal{L}_{PAL} &= \mathcal{L}_{FSL} + \lambda\,\mathcal{L}_{SSL} \tag{4} \end{align*}$

This progressive supervision enables the backbone to specialize: first-shot for finer localization of small faces, second-shot for semantic robustness in medium to large faces.

4. Improved Anchor Matching (IAM)

4.1 Motivation and Core Procedure

Conventional anchor assignment based solely on IoU thresholds inadequately samples small faces, resulting in insufficient positive anchors. IAM synchronizes data augmentation with anchor scale assignment to maximize the ratio of positive anchor matches per face, crucial for learning robust regressors.

With $0.4$ probability in each sample:

An anchor-based sampling is performed: a ground-truth face of size $S_{face}$ is chosen, a target anchor scale $S_{anchor} \in \{16, 32, 64, 128, 256, 512\}$ is uniformly sampled, and a sub-image is cropped so the resized face matches this anchor. The crop is resized to $640 \times 640$ . With $0.6$ probability, standard SSD augmentations (random crop/flip/color shift) are applied.

Anchor-to-ground-truth assignments are:

Positive: $\text{IoU} \ge 0.4$
Negative: $\text{IoU} \le 0.3$

This approach increases mean anchor matches per face (from $\approx6.4$ to $\approx6.9$ ), reduces anchor/face scale mismatch, and accelerates the convergence and stability of bounding-box regression.

5. Training Protocol and Implementation

5.1 Anchor, Augmentation, and Hyperparameter Specification

Anchors are assigned across six feature map levels ( $l=1$ to $6$), with corresponding strides of $\{4, 8, 16, 32, 64, 128\}$ for input size $640 \times 640$ pixels, producing feature maps at $\{160^2, 80^2, 40^2, 20^2, 10^2, 5^2\}$ . Anchor scales:

Second-shot ( $a_l$ ): $\{16, 32, 64, 128, 256, 512\}$ pixels
First-shot ( $sa_l$ ): $0.5 a_l$

All anchors use a $1.5:1$ aspect ratio based on face dataset statistics.

Data augmentation applies a blend of photometric distortions, random horizontal flip, anchor-based sampling, and standard SSD random crops.

Optimization uses SGD with momentum $0.9$, weight decay $5 \times 10^{-4}$ , and a learning rate schedule of $1 \times 10^{-3}$ (first $40k$ steps), $1 \times 10^{-4}$ (next $10k$), and $1 \times 10^{-5}$ (final $10k$). Batch size is $16$ over $4$ GPUs ( $\approx60k$ total steps). Backbones are ImageNet-pretrained (VGG16, ResNet-50/101/152), with new layers initialized by Xavier.

6. Experimental Evaluation

6.1 WIDER FACE and FDDB Metrics

DSFD outperforms contemporaneous detectors (e.g., PyramidBox, SRN, SSH, S³FD), exhibiting the following performance:

Dataset-Split	Easy	Medium	Hard
WIDER-val (AP, %)	96.6	95.7	90.4
WIDER-test (AP, %)	96.0	95.3	90.0

On the FDDB benchmark (recall at $1000$ false positives):

Discrete ROC: $99.1\%$
Continuous ROC: $86.2\%$

6.2 Ablation Analysis

Ablation studies validate each module's contribution:

FEM (on VGG16-FSSD baseline): +0.4% (easy), +1.2% (medium), +5.5% (hard)
PAL (on ResNet-50+FEM): +0.3% (easy), +0.3% (medium), +0.6% (hard)
IAM (on ResNet-101+FEM): +0.3% (easy), +0.1% (medium), +0.3% (hard)

6.3 Inference Speed

DSFD achieves $\approx22$ FPS (ResNet-50, VGA resolution, NVIDIA P40), with negligible runtime overhead compared to a single-shot SSD, as only the second-shot head is used at inference.

7. Practical Considerations and Deployment

7.1 Reproducibility and Code Access

Canonical code and pretrained models for DSFD are located at https://github.com/TencentYoutuResearch/FaceDetection-DSFD. Training is conducted solely on the WIDER FACE dataset following the specified schedule and augmentations.

7.2 Backbone Selection

Deeper backbones (SE-ResNeXt101, DPN-98) offer incremental AP improvements but increase inference cost; DSFD remains backbone-agnostic and supports trade-offs for real-time scenarios (e.g., ResNet-50 or MobileNet).

7.3 Deployment Guidance

For inference, the first-shot head is omitted. Image pyramids ($0.5$– $2.0\times$ input scale) are advised in cases of extreme face scale variability. Quantization or TensorRT conversion can further accelerate deployment.

DSFD’s architectural, supervisory, and augmentation innovations enable accurate detection over a wide range of face scales, rendering it an effective baseline for face detection research and application (Li et al., 2018).

PDF Markdown Chat (Pro)

References (1)

DSFD: Dual Shot Face Detector (2018)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Dual Shot Face Detector (DSFD).

DSFD: Dual Shot Face Detector

1. Architectural Structure and Dual-Shot Paradigm

2. Feature Enhance Module (FEM)

2.1 Motivation

2.2 Structure and Mathematical Formulation

3. Progressive Anchor Loss (PAL)

3.1 Rationale and Implementation

4. Improved Anchor Matching (IAM)

4.1 Motivation and Core Procedure

5. Training Protocol and Implementation

5.1 Anchor, Augmentation, and Hyperparameter Specification

6. Experimental Evaluation

6.1 WIDER FACE and FDDB Metrics

6.2 Ablation Analysis

6.3 Inference Speed

7. Practical Considerations and Deployment

7.1 Reproducibility and Code Access

7.2 Backbone Selection

7.3 Deployment Guidance

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DSFD: Dual Shot Face Detector

1. Architectural Structure and Dual-Shot Paradigm

2. Feature Enhance Module (FEM)

2.1 Motivation

2.2 Structure and Mathematical Formulation

3. Progressive Anchor Loss (PAL)

3.1 Rationale and Implementation

4. Improved Anchor Matching (IAM)

4.1 Motivation and Core Procedure

5. Training Protocol and Implementation

5.1 Anchor, Augmentation, and Hyperparameter Specification

6. Experimental Evaluation

6.1 WIDER FACE and FDDB Metrics

6.2 Ablation Analysis

6.3 Inference Speed

7. Practical Considerations and Deployment

7.1 Reproducibility and Code Access

7.2 Backbone Selection

7.3 Deployment Guidance

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research