Progressive Anchor Loss (PAL)

Updated 22 November 2025

Progressive Anchor Loss (PAL) is a dual-shot loss formulation that enhances face detection by using small and large anchor sets to supervise different feature maps.
PAL applies two parallel multi-task losses with distinct anchor scales to improve the detection of small and occluded faces without adding inference time.
Empirical evaluations on the WIDER FACE dataset show that integrating PAL into DSFD models consistently boosts accuracy, especially on challenging subsets.

Progressive Anchor Loss (PAL) is a multi-shot loss formulation introduced in the Dual Shot Face Detector (DSFD) framework to improve face detection performance by progressively supervising features with two distinct sets of anchor boxes. PAL employs a dual-branch design that applies the standard detection losses to both a set of small anchors on earlier feature maps and a set of larger anchors on enhanced feature maps, combining their losses to promote robust feature learning across object scales. The resulting technique enhances detection, especially of small and hard-to-detect faces, without incurring inference-time overhead (Li et al., 2018).

1. Formal Definition and Mathematical Formulation

Let $i$ index all anchors in a mini-batch. For each anchor $i$ :

$p_i \in [0,1]$ : predicted face probability,
$p_i^* \in \{0,1\}$ : ground-truth label (1 = positive, 0 = negative),
$t_i = (t_x, t_y, t_w, t_h)_i$ : predicted bounding box regression offsets,
$g_i = (g_x, g_y, g_w, g_h)_i$ : ground-truth box regression offsets,
$a_i$ : “second-shot” (larger) anchor box parameters,
$sa_i$ : “first-shot” (smaller) anchor box parameters.

For each anchor, the classification loss $L_\text{conf}$ is the two-class softmax cross-entropy, and the regression loss $L_\text{loc}$ is the Smooth L1 distance, parameterized by the anchor reference.

Two loss terms are defined:

First Shot Loss (FSL; small anchors: $sa_i$ ):

$L_\text{FSL} = \frac{1}{N_\text{conf}} \sum_i L_\text{conf}(p_i,p_i^*) + \frac{\beta}{N_\text{loc}} \sum_i p_i^*\, L_\text{loc}(t_i,g_i; sa_i)$

Second Shot Loss (SSL; large anchors: $a_i$ ):

$L_\text{SSL} = \frac{1}{N_\text{conf}} \sum_i L_\text{conf}(p_i,p_i^*) + \frac{\beta}{N_\text{loc}} \sum_i p_i^*\, L_\text{loc}(t_i,g_i; a_i)$

The Progressive Anchor Loss is:

$L_\text{PAL} = L_\text{FSL} + \lambda\, L_\text{SSL}$

where:

$N_\text{conf}$ : number of anchors (positive and negative) used for normalization,
$N_\text{loc}$ : number of positive anchors for regression normalization,
$\beta$ : classification/regression balancing factor (typically 1),
$\lambda$ : weighting between the two shots (typically 1).

2. Dual Anchor Set Construction

DSFD divides anchor supervision into two parallel “shots” on each feature map cell:

Shot	Anchor Set	Anchor Scale (at stride $s$ )	Target Feature Map
First Shot	$sa_i$	$\frac{1}{2}$ standard scale	Original feature maps
Second Shot	$a_i$	Standard scale	Enhanced (FEM) feature maps

For a typical 640×640 image, six feature maps of strides $\{4,8,16,32,64,128\}$ are used. At each location:

Second-shot anchors: $\{16,32,64,128,256,512\}$ pixels,
First-shot anchors: $\{8,16,32,64,128,256\}$ pixels (exactly half).

All anchors have an aspect ratio of 1.5:1, and each cell yields one anchor per shot, resulting in approximately $30$k anchors per shot per image.

3. Loss Construction and Hyperparameters

PAL extends the one-stage detection multi-task loss by duplicating it across two anchor sets while sharing prediction heads but using distinct anchor sizes.

The main hyperparameters are:

$\beta = 1$ (within-shot classification/regression weight),
$\lambda = 1$ (between-shot loss combination),
IoU threshold for anchor mining: $0.4$,
Aspect ratio: $1.5:1$.

This configuration provides a training signal at both small and large spatial scales. The anchors for each shot are normalized separately, and both losses are summed.

4. Motivation and Curriculum Design

Early feature maps possess high spatial resolution but limited semantic abstraction. By assigning smaller anchors and computing an auxiliary first-shot loss, PAL drives early-stage features to encode fine-grained details, which is beneficial for small face detection. In contrast, deeper feature maps undergo enhancement through the Feature Enhance Module (FEM) and use larger anchors, targeting richer contextual representations suited for larger faces.

The progressive use of anchors—small at early stages, large at later stages—architects a curriculum that mirrors a coarse-to-fine detection paradigm. As a result, PAL facilitates more discriminative feature learning, supports stable convergence, and provides auxiliary supervision analogous to deep supervision strategies.

5. Training and Inference Methodology

PAL is integrated into the training workflow as follows:

The total loss is $L_\text{PAL} = L_\text{FSL} + L_\text{SSL}$ .
Backpropagation applies to both shots jointly; no masking or sequential updating is used.
Inference utilizes only the second-shot (large anchor) outputs; all first-shot predictions are ignored, incurring no inference-time cost.

Anchor Matching:

An anchor is positive if $\text{IoU}(\text{anchor},\text{gt}) \geq 0.4$ , negative otherwise.

Data Augmentation (Improved Anchor Matching):

Probability $0.4$: data-anchor-sampling (as in PyramidBox)—random cropping at matched face-to-crop ratios ($640$/random selection of $\{16,32,64,128,256,512\}$ ).
Probability $0.6$: SSD-style random crop, color jitter, horizontal flip.

Optimization Regimen:

SGD with momentum $0.9$, weight decay $5 \times 10^{-4}$ , batch size $16$.
Learning rate: $1\times 10^{-3}$ for $40$k iters, $1\times 10^{-4}$ for $10$k, $1\times 10^{-5}$ for $10$k.
At test time: select top $5000$ second-shot detections, apply NMS@$0.3$, retain top $750$.

6. Empirical Results and Ablation Studies

Empirical evaluation on the WIDER FACE validation set demonstrates PAL's quantitative benefit:

Configuration	Easy	Medium	Hard
Res50 + FEM, no PAL	95.0	94.1	88.0
+ PAL (add $L_\text{FSL}$ + $L_\text{SSL}$ )	95.3 (+0.3)	94.4 (+0.3)	88.6 (+0.6)
Res101 + FEM + IAM, no PAL	95.8	95.1	89.7
Res101 + FEM + IAM + PAL	96.3 (+0.5)	95.4 (+0.3)	90.1 (+0.4)

These improvements are consistent across feature extractors and reinforce PAL's effect, particularly on the Hard subset, which is characterized by small and occluded faces (Li et al., 2018).

7. Implementation Considerations and Generalization

For transfer to generic single-shot detection architectures (e.g., SSD, RetinaNet), PAL can be implemented by:

Duplicating detection heads per feature map.
Assigning “small” and “large” anchor sets to each head.
Computing two parallel multi-task losses and summing them, optionally balancing with $\lambda$ .
At inference, retaining only the large-anchor predictions.

Key practical points:

Training cost increases modestly due to doubled anchor count and related matching/loss computation.
No inference-time overhead exists, as auxiliary (small-anchor) heads are omitted post-training.
Hyperparameters, particularly anchor size and aspect ratio, require adaptation for new domains if object scales and shapes diverge from face detection. A plausible implication is the need for extensive tuning for generalized object detection tasks.

Progressive Anchor Loss thus provides a straightforward yet effective extension of one-stage detectors by integrating scale-structured supervision, delivering accuracy gains for small-object detection and robust feature discrimination with minimal additional complexity (Li et al., 2018).

PDF Markdown Chat (Pro)

References (1)

DSFD: Dual Shot Face Detector (2018)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Progressive Anchor Loss (PAL).