DSFD: Dual Shot Face Detector
- The paper introduces DSFD, a dual-shot face detector that uses a two-stage inference procedure to improve small and multi-scale face detection.
- It leverages key innovations—Feature Enhance Module (FEM), Progressive Anchor Loss (PAL), and Improved Anchor Matching (IAM)—to boost feature learning and anchor assignment.
- Extensive experiments on WIDER FACE and FDDB benchmarks validate DSFD's high accuracy and efficient deployment in real-world face detection scenarios.
The Dual Shot Face Detector (DSFD) is an advanced face detection framework that achieves superior accuracy by addressing fundamental challenges in feature learning, supervision strategy across scales, and anchor assignment as grounded in data augmentation. DSFD forgoes the standard single-shot detection paradigm, introducing a two-stage (“dual shot”) inference and training procedure over the same backbone to enhance detection of faces across a broad range of scales. Its core innovations—Feature Enhance Module (FEM), Progressive Anchor Loss (PAL), and Improved Anchor Matching (IAM)—collectively allow DSFD to outperform prior state-of-the-art detectors on the WIDER FACE and FDDB benchmarks (Li et al., 2018).
1. Architectural Structure and Dual-Shot Paradigm
DSFD extends single-stage detectors like SSD into a dual-shot pipeline using a shared convolutional backbone (e.g., VGG16 or ResNet). The architecture comprises:
- First Shot: Utilizes original feature maps extracted from intermediate backbone layers (i.e., conv3_3, conv4_3, conv5_3, fc7, conv6_2, conv7_2). A detection head, structurally similar to SSD, operates with smaller anchors to better supervise the early layers toward high-resolution, small-scale face localization.
- Feature Enhance Module (FEM): Processes each to produce spatially aligned enhanced feature maps , capturing multi-scale and large-receptive field context (see Section 2).
- Second Shot: Another SSD-style head is dispatched over with anchor sizes (where ). Only this head’s outputs are used during inference, so the dual-shot strategy incurs no extra time cost per forward pass.
The overall data flow can be conceptualized as follows:
| Step | Data | Operation/Output |
|---|---|---|
| Input image | — | Backbone → |
| First Shot Head | , | Early-scale loss supervision |
| FEM | Enhanced | |
| Second Shot Head | , | Detection/prediction (output) |
2. Feature Enhance Module (FEM)
2.1 Motivation
Standard FPNs fuse high- and low-level features via lateral connections and RFB blocks focus on dilation-based context expansion but do not combine both strategies. FEM merges multi-level context aggregation and enlarged receptive fields, thus enhancing semantic richness and spatial coverage in feature maps.
2.2 Structure and Mathematical Formulation
For each layer :
- Apply a convolution to normalize the feature map .
- Bilinearly up-sample to match the spatial size, performing element-wise multiplication with the normalized .
- Split resulting tensor into three branches; each passes through stacked dilated convolutions at rates , , .
- Concatenate the branch outputs along the channel axis, yielding enhanced feature .
Explicitly, for spatial location : where denotes up-sampling, is element-wise product, and are parallel dilated convolution branches.
This design generalizes the standard SSD by introducing for each layer a semantically parallel “shot” with higher receptive field per layer.
3. Progressive Anchor Loss (PAL)
3.1 Rationale and Implementation
PAL leverages the intuition that lower-level (earlier) features are optimal for small-object localization, while higher-level (or enhanced) features are advantageous for larger objects due to greater semantic aggregation.
- First Shot Loss (FSL): Supervises with small anchors .
- Second Shot Loss (SSL): Supervises with standard (larger) anchors .
- Combined Loss: Both loss terms are combined with a balancing coefficient (set to $1$ in practice):
$\begin{align*} \mathcal{L}_{SSL} &= \frac{1}{N_{conf}}\sum_i L_{conf}(p_i,p_i^*) + \frac{\beta}{N_{loc}}\sum_i p_i^*\,L_{loc}(t_i,g_i|a_i) \tag{2} \ \mathcal{L}_{FSL} &= \frac{1}{N_{conf}}\sum_i L_{conf}(p_i,p_i^*) + \frac{\beta}{N_{loc}}\sum_i p_i^*\,L_{loc}(t_i,g_i|sa_i) \tag{3} \ \mathcal{L}_{PAL} &= \mathcal{L}_{FSL} + \lambda\,\mathcal{L}_{SSL} \tag{4} \end{align*}$
This progressive supervision enables the backbone to specialize: first-shot for finer localization of small faces, second-shot for semantic robustness in medium to large faces.
4. Improved Anchor Matching (IAM)
4.1 Motivation and Core Procedure
Conventional anchor assignment based solely on IoU thresholds inadequately samples small faces, resulting in insufficient positive anchors. IAM synchronizes data augmentation with anchor scale assignment to maximize the ratio of positive anchor matches per face, crucial for learning robust regressors.
With $0.4$ probability in each sample:
- An anchor-based sampling is performed: a ground-truth face of size is chosen, a target anchor scale is uniformly sampled, and a sub-image is cropped so the resized face matches this anchor. The crop is resized to . With $0.6$ probability, standard SSD augmentations (random crop/flip/color shift) are applied.
Anchor-to-ground-truth assignments are:
- Positive:
- Negative:
This approach increases mean anchor matches per face (from to ), reduces anchor/face scale mismatch, and accelerates the convergence and stability of bounding-box regression.
5. Training Protocol and Implementation
5.1 Anchor, Augmentation, and Hyperparameter Specification
Anchors are assigned across six feature map levels ( to $6$), with corresponding strides of for input size pixels, producing feature maps at . Anchor scales:
- Second-shot (): pixels
- First-shot ():
All anchors use a $1.5:1$ aspect ratio based on face dataset statistics.
Data augmentation applies a blend of photometric distortions, random horizontal flip, anchor-based sampling, and standard SSD random crops.
Optimization uses SGD with momentum $0.9$, weight decay , and a learning rate schedule of (first $40k$ steps), (next $10k$), and (final $10k$). Batch size is $16$ over $4$ GPUs ( total steps). Backbones are ImageNet-pretrained (VGG16, ResNet-50/101/152), with new layers initialized by Xavier.
6. Experimental Evaluation
6.1 WIDER FACE and FDDB Metrics
DSFD outperforms contemporaneous detectors (e.g., PyramidBox, SRN, SSH, S³FD), exhibiting the following performance:
| Dataset-Split | Easy | Medium | Hard |
|---|---|---|---|
| WIDER-val (AP, %) | 96.6 | 95.7 | 90.4 |
| WIDER-test (AP, %) | 96.0 | 95.3 | 90.0 |
On the FDDB benchmark (recall at $1000$ false positives):
- Discrete ROC:
- Continuous ROC:
6.2 Ablation Analysis
Ablation studies validate each module's contribution:
- FEM (on VGG16-FSSD baseline): +0.4% (easy), +1.2% (medium), +5.5% (hard)
- PAL (on ResNet-50+FEM): +0.3% (easy), +0.3% (medium), +0.6% (hard)
- IAM (on ResNet-101+FEM): +0.3% (easy), +0.1% (medium), +0.3% (hard)
6.3 Inference Speed
DSFD achieves FPS (ResNet-50, VGA resolution, NVIDIA P40), with negligible runtime overhead compared to a single-shot SSD, as only the second-shot head is used at inference.
7. Practical Considerations and Deployment
7.1 Reproducibility and Code Access
Canonical code and pretrained models for DSFD are located at https://github.com/TencentYoutuResearch/FaceDetection-DSFD. Training is conducted solely on the WIDER FACE dataset following the specified schedule and augmentations.
7.2 Backbone Selection
Deeper backbones (SE-ResNeXt101, DPN-98) offer incremental AP improvements but increase inference cost; DSFD remains backbone-agnostic and supports trade-offs for real-time scenarios (e.g., ResNet-50 or MobileNet).
7.3 Deployment Guidance
For inference, the first-shot head is omitted. Image pyramids ($0.5$– input scale) are advised in cases of extreme face scale variability. Quantization or TensorRT conversion can further accelerate deployment.
DSFD’s architectural, supervisory, and augmentation innovations enable accurate detection over a wide range of face scales, rendering it an effective baseline for face detection research and application (Li et al., 2018).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free