Papers
Topics
Authors
Recent
Search
2000 character limit reached

SSD Detection Head Architecture

Updated 18 February 2026
  • SSD Detection Head is a fully convolutional module that predicts object class confidences and bounding box offsets on multi-scale feature maps.
  • It uses predefined default boxes with specific scales and aspect ratios to ensure dense coverage for detecting objects of various sizes.
  • Optimizations such as hard negative mining and specialized variants (Tiny SSD, MHD-Net, PDM-SSD) enhance detection efficiency and accuracy across different modalities.

A Single Shot MultiBox Detector (SSD) detection head is a fully convolutional module that simultaneously predicts class confidences and regresses bounding box coordinates for a fixed set of default boxes (anchor boxes) at each spatial location across multiple feature maps of varying resolution. Unlike two-stage detectors that decouple candidate region proposal and classification, the SSD detection head integrates all computations in a single unified forward pass, enabling real-time object detection with high accuracy across diverse object scales (Liu et al., 2015).

1. Architectural Overview of the SSD Detection Head

The detection head in SSD operates on top of a truncated base convolutional network (VGG-16 in the standard SSD300), replacing the dense classification layers (fc6, fc7) with convolutional counterparts to preserve spatial dimensions and facilitate dense predictions. Prediction heads are attached to six progressively downsampled feature maps for SSD300:

Feature Map Size Depth #Default Boxes/Cell (k)
conv4_3 38×38 512 4
conv7 19×19 1024 6
conv8_2 10×10 512 6
conv9_2 5×5 256 6
conv10_2 3×3 256 6
conv11_2 1×1 256 6

On each spatial cell of every feature map, the detection head uses a stack of 3×33\times 3 convolution filters to predict for every default box: (a) cc class scores (including background) and (b) 4 offsets parameterizing the bounding box relative to the anchor (Liu et al., 2015).

2. Default Boxes: Parameterization and Assignment

Default boxes (anchors) are defined per feature cell for each map, parameterized by scale and aspect ratio. The scale for the kk-th feature map, sks_k, is linearly interpolated between smin=0.2s_{min}=0.2 and smax=0.9s_{max}=0.9:

sk=smin+(smaxsmin)(k1)m1s_k = s_{min} + \frac{(s_{max} - s_{min}) \cdot (k-1)}{m-1}

For each aspect ratio ar{1,2,3,1/2,1/3}a_r \in \{1, 2, 3, 1/2, 1/3\}, the normalized width and height are set as:

wka=skarw_k^a = s_k\sqrt{a_r}

hka=skarh_k^a = \frac{s_k}{\sqrt{a_r}}

An additional scale-companion box is added for ar=1a_r=1 (with scale sk=sksk+1s'_k = \sqrt{s_k s_{k+1}}). The default box center at location (i,j)(i, j) on a map of spatial dimension fk|f_k| is:

(cx,cy)=(i+0.5fk,j+0.5fk)(c_x, c_y) = \left(\frac{i + 0.5}{|f_k|}, \frac{j + 0.5}{|f_k|}\right)

Only boxes with Jaccard overlap (IoU) 0.5\geq 0.5 with any ground-truth box are considered positives; each ground truth is assigned to its best-overlap default (Liu et al., 2015).

3. Prediction, Decoding, and Training Targets

For each default box ii and class pp, the detection head predicts:

  • A confidence score cipc_i^p
  • Box regression offsets ti=(tcx,tcy,tw,th)t_i = (t_{cx}, t_{cy}, t_w, t_h)

The predicted offsets are decoded to box coordinates as:

y^icx=diwtcx+dicx\hat{y}_i^{cx} = d_i^w \cdot t_{cx} + d_i^{cx}

y^icy=dihtcy+dicy\hat{y}_i^{cy} = d_i^h \cdot t_{cy} + d_i^{cy}

y^iw=diwexp(tw)\hat{y}_i^w = d_i^w \cdot \exp(t_w)

y^ih=dihexp(th)\hat{y}_i^h = d_i^h \cdot \exp(t_h)

During training, the target offsets are:

g^jcx=gjcxdicxdiw\hat{g}_j^{cx} = \frac{g_j^{cx} - d_i^{cx}}{d_i^w}

g^jcy=gjcydicydih\hat{g}_j^{cy} = \frac{g_j^{cy} - d_i^{cy}}{d_i^h}

g^jw=log(gjwdiw)\hat{g}_j^w = \log\left(\frac{g_j^w}{d_i^w}\right)

g^jh=log(gjhdih)\hat{g}_j^h = \log\left(\frac{g_j^h}{d_i^h}\right)

4. Multi-Scale and Receptive Field Strategy

By employing multiple prediction heads attached to feature maps of descending spatial resolution, SSD achieves dense coverage of object scales:

  • Early, high-resolution maps (e.g., conv4_3 at 38×38) detect small objects.
  • Intermediate maps (e.g., conv7 at 19×19) specialize in mid-sized objects.
  • Low-resolution maps (e.g., conv11_2 at 1×1) cover very large objects.

This architecture supports real-time performance and accuracy parity—or superiority—relative to two-stage detectors, through seamless multi-scale feature integration (Liu et al., 2015).

5. Loss Function and Hard Negative Mining

Training the detection head involves the composite MultiBox loss:

L(x,c,l,g)=1N[Lconf(x,c)+αLloc(x,l,g)]L(x, c, l, g) = \frac{1}{N}\left[L_{conf}(x, c) + \alpha L_{loc}(x, l, g)\right]

Where NN is the number of positive (matched) default boxes and α=1\alpha=1. The components comprise:

  • Confidence loss (LconfL_{conf}): Softmax loss over classes, combining positives and hard-mined negatives.
  • Localization loss (LlocL_{loc}): Smooth L1 loss over bounding box regression parameters.

Hard negative mining keeps at most a 3:1 ratio of negatives to positives by sorting negative predictions by confidence loss and selecting the highest-loss examples (Liu et al., 2015).

6. Inference and Post-Processing

At inference, the SSD detection head yields 104\sim10^4 candidate boxes with confidences. The pipeline involves:

  • Thresholding low-confidence boxes (score < 0.01).
  • Non-Maximum Suppression per class (IoU threshold 0.45).
  • Keeping the top 200 detections per image.

For SSD300, this protocol achieves 72.1% mAP at 58 FPS on VOC2007 test images (300×300 input, Titan X GPU), outperforming comparably efficient single-stage detectors and matching two-stage detector accuracy (Liu et al., 2015).

7. Variants and Optimizations

The SSD detection head paradigm has been adapted and optimized in various contexts:

Tiny SSD (Wong et al., 2018):

  • Employs non-uniform Fire modules (from SqueezeNet) as the base, with aggressively pruned channel counts in auxiliary feature layers.
  • Retains the SSD detection head structure but applies it to bespoke feature maps (Fire modules and small convs), matching the SSD-300 loss and box formulations.
  • Uses half-precision FP16 weights and activations, all-3×33\times3 convs in detection heads, and yields a compact model (2.3 MB, 61.3% VOC2007 mAP).

MHD-Net (Shi et al., 2022):

  • Proposes a matching strategy between detection-head receptive fields and object-size distribution.
  • Demonstrates that two carefully selected heads (rather than six) can achieve close to maximal accuracy for many traffic datasets, with significant reductions in model size, FLOPs, and improved speed.
  • Introduces a lightweight dilated-convolution module to expand receptive fields on shallow features, further boosting detection accuracy at marginal computational cost.

PDM-SSD (3D SSD) (Liang et al., 10 Feb 2025):

  • Extends the SSD detection head concept to LiDAR point clouds, combining a scene heatmap branch (grid-based) and a VoteNet-style point-based branch with calibration via channel attention.
  • Achieves per-class grid heatmaps, vote-based box regression, and hybrid grid-point feature fusion for both classification and regression, demonstrating higher efficiency for 3D single-stage detection tasks.

These adaptations illustrate the modularity of the SSD detection head, its versatility across architectures and modalities, and the continuous search for optimizations in parameter efficiency and computational performance.

References:

(Liu et al., 2015, Wong et al., 2018, Shi et al., 2022, Liang et al., 10 Feb 2025)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SSD Detection Head.