SSD Detection Head Architecture
- SSD Detection Head is a fully convolutional module that predicts object class confidences and bounding box offsets on multi-scale feature maps.
- It uses predefined default boxes with specific scales and aspect ratios to ensure dense coverage for detecting objects of various sizes.
- Optimizations such as hard negative mining and specialized variants (Tiny SSD, MHD-Net, PDM-SSD) enhance detection efficiency and accuracy across different modalities.
A Single Shot MultiBox Detector (SSD) detection head is a fully convolutional module that simultaneously predicts class confidences and regresses bounding box coordinates for a fixed set of default boxes (anchor boxes) at each spatial location across multiple feature maps of varying resolution. Unlike two-stage detectors that decouple candidate region proposal and classification, the SSD detection head integrates all computations in a single unified forward pass, enabling real-time object detection with high accuracy across diverse object scales (Liu et al., 2015).
1. Architectural Overview of the SSD Detection Head
The detection head in SSD operates on top of a truncated base convolutional network (VGG-16 in the standard SSD300), replacing the dense classification layers (fc6, fc7) with convolutional counterparts to preserve spatial dimensions and facilitate dense predictions. Prediction heads are attached to six progressively downsampled feature maps for SSD300:
| Feature Map | Size | Depth | #Default Boxes/Cell (k) |
|---|---|---|---|
| conv4_3 | 38×38 | 512 | 4 |
| conv7 | 19×19 | 1024 | 6 |
| conv8_2 | 10×10 | 512 | 6 |
| conv9_2 | 5×5 | 256 | 6 |
| conv10_2 | 3×3 | 256 | 6 |
| conv11_2 | 1×1 | 256 | 6 |
On each spatial cell of every feature map, the detection head uses a stack of convolution filters to predict for every default box: (a) class scores (including background) and (b) 4 offsets parameterizing the bounding box relative to the anchor (Liu et al., 2015).
2. Default Boxes: Parameterization and Assignment
Default boxes (anchors) are defined per feature cell for each map, parameterized by scale and aspect ratio. The scale for the -th feature map, , is linearly interpolated between and :
For each aspect ratio , the normalized width and height are set as:
An additional scale-companion box is added for (with scale ). The default box center at location on a map of spatial dimension is:
Only boxes with Jaccard overlap (IoU) with any ground-truth box are considered positives; each ground truth is assigned to its best-overlap default (Liu et al., 2015).
3. Prediction, Decoding, and Training Targets
For each default box and class , the detection head predicts:
- A confidence score
- Box regression offsets
The predicted offsets are decoded to box coordinates as:
During training, the target offsets are:
4. Multi-Scale and Receptive Field Strategy
By employing multiple prediction heads attached to feature maps of descending spatial resolution, SSD achieves dense coverage of object scales:
- Early, high-resolution maps (e.g., conv4_3 at 38×38) detect small objects.
- Intermediate maps (e.g., conv7 at 19×19) specialize in mid-sized objects.
- Low-resolution maps (e.g., conv11_2 at 1×1) cover very large objects.
This architecture supports real-time performance and accuracy parity—or superiority—relative to two-stage detectors, through seamless multi-scale feature integration (Liu et al., 2015).
5. Loss Function and Hard Negative Mining
Training the detection head involves the composite MultiBox loss:
Where is the number of positive (matched) default boxes and . The components comprise:
- Confidence loss (): Softmax loss over classes, combining positives and hard-mined negatives.
- Localization loss (): Smooth L1 loss over bounding box regression parameters.
Hard negative mining keeps at most a 3:1 ratio of negatives to positives by sorting negative predictions by confidence loss and selecting the highest-loss examples (Liu et al., 2015).
6. Inference and Post-Processing
At inference, the SSD detection head yields candidate boxes with confidences. The pipeline involves:
- Thresholding low-confidence boxes (score < 0.01).
- Non-Maximum Suppression per class (IoU threshold 0.45).
- Keeping the top 200 detections per image.
For SSD300, this protocol achieves 72.1% mAP at 58 FPS on VOC2007 test images (300×300 input, Titan X GPU), outperforming comparably efficient single-stage detectors and matching two-stage detector accuracy (Liu et al., 2015).
7. Variants and Optimizations
The SSD detection head paradigm has been adapted and optimized in various contexts:
Tiny SSD (Wong et al., 2018):
- Employs non-uniform Fire modules (from SqueezeNet) as the base, with aggressively pruned channel counts in auxiliary feature layers.
- Retains the SSD detection head structure but applies it to bespoke feature maps (Fire modules and small convs), matching the SSD-300 loss and box formulations.
- Uses half-precision FP16 weights and activations, all- convs in detection heads, and yields a compact model (2.3 MB, 61.3% VOC2007 mAP).
MHD-Net (Shi et al., 2022):
- Proposes a matching strategy between detection-head receptive fields and object-size distribution.
- Demonstrates that two carefully selected heads (rather than six) can achieve close to maximal accuracy for many traffic datasets, with significant reductions in model size, FLOPs, and improved speed.
- Introduces a lightweight dilated-convolution module to expand receptive fields on shallow features, further boosting detection accuracy at marginal computational cost.
PDM-SSD (3D SSD) (Liang et al., 10 Feb 2025):
- Extends the SSD detection head concept to LiDAR point clouds, combining a scene heatmap branch (grid-based) and a VoteNet-style point-based branch with calibration via channel attention.
- Achieves per-class grid heatmaps, vote-based box regression, and hybrid grid-point feature fusion for both classification and regression, demonstrating higher efficiency for 3D single-stage detection tasks.
These adaptations illustrate the modularity of the SSD detection head, its versatility across architectures and modalities, and the continuous search for optimizations in parameter efficiency and computational performance.
References:
(Liu et al., 2015, Wong et al., 2018, Shi et al., 2022, Liang et al., 10 Feb 2025)