Progressive Anchor Loss (PAL)
- Progressive Anchor Loss (PAL) is a dual-shot loss formulation that enhances face detection by using small and large anchor sets to supervise different feature maps.
- PAL applies two parallel multi-task losses with distinct anchor scales to improve the detection of small and occluded faces without adding inference time.
- Empirical evaluations on the WIDER FACE dataset show that integrating PAL into DSFD models consistently boosts accuracy, especially on challenging subsets.
Progressive Anchor Loss (PAL) is a multi-shot loss formulation introduced in the Dual Shot Face Detector (DSFD) framework to improve face detection performance by progressively supervising features with two distinct sets of anchor boxes. PAL employs a dual-branch design that applies the standard detection losses to both a set of small anchors on earlier feature maps and a set of larger anchors on enhanced feature maps, combining their losses to promote robust feature learning across object scales. The resulting technique enhances detection, especially of small and hard-to-detect faces, without incurring inference-time overhead (Li et al., 2018).
1. Formal Definition and Mathematical Formulation
Let index all anchors in a mini-batch. For each anchor :
- : predicted face probability,
- : ground-truth label (1 = positive, 0 = negative),
- : predicted bounding box regression offsets,
- : ground-truth box regression offsets,
- : “second-shot” (larger) anchor box parameters,
- : “first-shot” (smaller) anchor box parameters.
For each anchor, the classification loss is the two-class softmax cross-entropy, and the regression loss is the Smooth L1 distance, parameterized by the anchor reference.
Two loss terms are defined:
- First Shot Loss (FSL; small anchors: ):
- Second Shot Loss (SSL; large anchors: ):
The Progressive Anchor Loss is:
where:
- : number of anchors (positive and negative) used for normalization,
- : number of positive anchors for regression normalization,
- : classification/regression balancing factor (typically 1),
- : weighting between the two shots (typically 1).
2. Dual Anchor Set Construction
DSFD divides anchor supervision into two parallel “shots” on each feature map cell:
| Shot | Anchor Set | Anchor Scale (at stride ) | Target Feature Map |
|---|---|---|---|
| First Shot | standard scale | Original feature maps | |
| Second Shot | Standard scale | Enhanced (FEM) feature maps |
For a typical 640×640 image, six feature maps of strides are used. At each location:
- Second-shot anchors: pixels,
- First-shot anchors: pixels (exactly half).
All anchors have an aspect ratio of 1.5:1, and each cell yields one anchor per shot, resulting in approximately $30$k anchors per shot per image.
3. Loss Construction and Hyperparameters
PAL extends the one-stage detection multi-task loss by duplicating it across two anchor sets while sharing prediction heads but using distinct anchor sizes.
The main hyperparameters are:
- (within-shot classification/regression weight),
- (between-shot loss combination),
- IoU threshold for anchor mining: $0.4$,
- Aspect ratio: $1.5:1$.
This configuration provides a training signal at both small and large spatial scales. The anchors for each shot are normalized separately, and both losses are summed.
4. Motivation and Curriculum Design
Early feature maps possess high spatial resolution but limited semantic abstraction. By assigning smaller anchors and computing an auxiliary first-shot loss, PAL drives early-stage features to encode fine-grained details, which is beneficial for small face detection. In contrast, deeper feature maps undergo enhancement through the Feature Enhance Module (FEM) and use larger anchors, targeting richer contextual representations suited for larger faces.
The progressive use of anchors—small at early stages, large at later stages—architects a curriculum that mirrors a coarse-to-fine detection paradigm. As a result, PAL facilitates more discriminative feature learning, supports stable convergence, and provides auxiliary supervision analogous to deep supervision strategies.
5. Training and Inference Methodology
PAL is integrated into the training workflow as follows:
- The total loss is .
- Backpropagation applies to both shots jointly; no masking or sequential updating is used.
- Inference utilizes only the second-shot (large anchor) outputs; all first-shot predictions are ignored, incurring no inference-time cost.
Anchor Matching:
- An anchor is positive if , negative otherwise.
Data Augmentation (Improved Anchor Matching):
- Probability $0.4$: data-anchor-sampling (as in PyramidBox)—random cropping at matched face-to-crop ratios ($640$/random selection of ).
- Probability $0.6$: SSD-style random crop, color jitter, horizontal flip.
Optimization Regimen:
- SGD with momentum $0.9$, weight decay , batch size $16$.
- Learning rate: for $40$k iters, for $10$k, for $10$k.
- At test time: select top $5000$ second-shot detections, apply NMS@$0.3$, retain top $750$.
6. Empirical Results and Ablation Studies
Empirical evaluation on the WIDER FACE validation set demonstrates PAL's quantitative benefit:
| Configuration | Easy | Medium | Hard |
|---|---|---|---|
| Res50 + FEM, no PAL | 95.0 | 94.1 | 88.0 |
| + PAL (add + ) | 95.3 (+0.3) | 94.4 (+0.3) | 88.6 (+0.6) |
| Res101 + FEM + IAM, no PAL | 95.8 | 95.1 | 89.7 |
| Res101 + FEM + IAM + PAL | 96.3 (+0.5) | 95.4 (+0.3) | 90.1 (+0.4) |
These improvements are consistent across feature extractors and reinforce PAL's effect, particularly on the Hard subset, which is characterized by small and occluded faces (Li et al., 2018).
7. Implementation Considerations and Generalization
For transfer to generic single-shot detection architectures (e.g., SSD, RetinaNet), PAL can be implemented by:
- Duplicating detection heads per feature map.
- Assigning “small” and “large” anchor sets to each head.
- Computing two parallel multi-task losses and summing them, optionally balancing with .
- At inference, retaining only the large-anchor predictions.
Key practical points:
- Training cost increases modestly due to doubled anchor count and related matching/loss computation.
- No inference-time overhead exists, as auxiliary (small-anchor) heads are omitted post-training.
- Hyperparameters, particularly anchor size and aspect ratio, require adaptation for new domains if object scales and shapes diverge from face detection. A plausible implication is the need for extensive tuning for generalized object detection tasks.
Progressive Anchor Loss thus provides a straightforward yet effective extension of one-stage detectors by integrating scale-structured supervision, delivering accuracy gains for small-object detection and robust feature discrimination with minimal additional complexity (Li et al., 2018).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free