Default Boxes & Aspect Ratios in SSD

Updated 19 November 2025

Default boxes and aspect ratios are predefined bounding boxes in SSD that discretize the output space for dense, multi-scale object detection.
Aspect ratios are chosen to span diverse object geometries, with an extra scale for ar=1 to interpolate between feature map resolutions and improve matching.
Adaptive methods adjust box aspect ratios based on dataset statistics, leading to measurable mAP improvements and enhanced localization precision.

Default boxes and aspect ratios constitute the core discretization mechanism underpinning single-shot object detectors such as SSD (Single Shot MultiBox Detector). Default boxes, also referred to as anchor boxes in some literature, are predefined bounding boxes of various scales and aspect ratios densely tiled across multiple locations and feature maps in the network. Through a combination of multi-scale and multi-aspect-ratio coverage, these boxes supply the geometric priors required for the network to localize object candidates of different shapes and sizes, enabling efficient dense prediction in a single feed-forward pass (Liu et al., 2015). Advances in adaptive box generation further improve localization fidelity by tuning these ratios to dataset statistics (Thakar et al., 2018).

1. Construction and Parameterization of Default Boxes in SSD

In SSD, the output space of bounding boxes is discretized into a set of default boxes at each spatial location of several feature maps that span different resolutions (Liu et al., 2015). Each default box is defined by:

A center position (determined by the feature map grid cell).
A scale $s_k$ specific to the $k$ -th feature map, computed as

$s_k = s_{min} + (s_{max} - s_{min}) \cdot \frac{k-1}{m-1}$

where $s_{min} = 0.2$ and $s_{max} = 0.9$ , and $m$ is the number of feature maps.

An aspect ratio $ar \in AR$ , with $AR = \{1, 2, 3, \frac{1}{2}, \frac{1}{3}\}$ .
Width and height, normalized by image size, given by $w_{k,ar} = s_k \cdot \sqrt{ar}$ and $h_{k,ar} = s_k / \sqrt{ar}$ .

At each spatial location, for each $ar \in AR$ , one default box is created, with an additional box for $ar=1$ at scale $s'_k = \sqrt{s_k \cdot s_{k+1}}$ to interpolate between scales, resulting in $|AR| + 1 = 6$ default boxes per location, except in certain layers where only four are used.

2. Aspect Ratios: Selection and Significance

The aspect ratio set $AR = \{1, 2, 3, 1/2, 1/3\}$ in the original SSD is chosen to span a broad range of object geometries, from slender/tall ( $ar < 1$ ) to wide objects ( $ar > 1$ ). Empirically, these choices provide effective coverage for common datasets such as VOC and COCO, facilitating the matching of default boxes to ground-truth objects of diverse shapes (Liu et al., 2015). Inclusion of an additional scale at $ar=1$ for each location introduces finer granularity between the scales used on adjacent feature maps, improving detection for object sizes that would otherwise fall between standard discretized scales.

3. Training: Default Box Matching, Supervision, and Loss

During training, ground-truth boxes are assigned to default boxes based on Jaccard (IoU) overlap. The procedure involves:

Each ground-truth box $g_j$ is matched to the default box $d_i$ with maximum IoU.
All default boxes with IoU exceeding 0.5 with any $g_j$ are additionally matched.
Matched boxes are labeled as positives; others as negatives.

A multi-task loss is computed per image:

$L(x, c, l, g) = \frac{1}{N} \big[ L_{conf}(x, c) + \alpha L_{loc}(x, l, g) \big]$

where $L_{conf}$ is the softmax loss over all classes, $L_{loc}$ is the Smooth L1 loss on box offsets, and $\alpha=1$ . Hard negative mining ensures the negative-to-positive ratio does not exceed 3:1 by sorting negatives on confidence loss and retaining the highest scoring (Liu et al., 2015).

4. Multi-Scale and Multi-Aspect Coverage Across Feature Maps

Default boxes are distributed across several feature maps of decreasing spatial resolution. For $300 \times 300$ input, feature map sizes include conv4_3: $38 \times 38$ , conv7: $19 \times 19$ , conv8_2: $10 \times 10$ , conv9_2: $5 \times 5$ , conv10_2: $3 \times 3$ , conv11_2: $1 \times 1$ . Smaller feature maps (large receptive field) are assigned larger $s_k$ , targeting large objects, while finer maps focus on small objects. At inference, each default box is scored for all object classes, box offsets are regressed, and post-processing (confidence thresholding and non-maximum suppression) yields detections (Liu et al., 2015).

Feature Map	Resolution ( $f_k \times f_k$ )	Box Scale ( $s_k$ ) Range	Number of Boxes per Location
conv4_3	$38\times38$	Small ( $\approx 0.2$ )	4–6
conv11_2	$1\times1$	Large ( $\approx 0.9$ )	4–6

5. Adaptive Default Box Selection: Data-Driven Aspect Ratios

The adaptive approach proposes that the empirical distribution of ground-truth box aspect ratios be estimated for each dataset, and that $K=5$ representative aspect ratios be selected by computing the mode, mean, median, 25th, and 75th percentiles from the histogram estimator $\hat{f}_X(x)$ . The algorithm extracts these quantiles from the empirical CDF $\hat{F}_X(x)$ and replaces the fixed SSD set with the resulting data-driven set $\{r_1,\dots,r_5\}$ (Thakar et al., 2018). The resulting box construction is otherwise unchanged from standard SSD, including interpolation for $ar\approx1$ as needed. On a construction equipment dataset, this method improved mAP from 0.51 to 0.54 (absolute +3%), with particular gains (+3–4%) for small or slender object classes. This suggests adaptive aspect ratios can significantly reduce localization mismatch for atypical object geometries.

6. Limitations, Extensions, and Future Directions

The histogram estimator used for modeling the aspect ratio distribution is a low-resolution approximation; the use of kernel density estimation or Gaussian mixture models could improve selection of representative ratios. Fixing $K=5$ may be suboptimal if the distribution is highly multimodal; $K$ could be chosen adaptively via Bayesian Information Criterion. Per-class ratio selection, rather than a single global set, may further improve performance for heterogeneous detection tasks. Integration of adaptive box selection into anchor-free or attention-based single-stage detectors could extend these benefits more broadly. A plausible implication is that data-driven box design is particularly beneficial for detection tasks with highly variable or atypical object shape statistics, as shown by improvements on datasets of small and complex objects (Thakar et al., 2018).

7. Summary and Practical Considerations

Default boxes and their associated aspect ratios are central to the efficiency, accuracy, and generalization properties of SSD-style object detectors. The fixed design in SSD, grounded in coverage of typical aspect ratios and multi-scale tiling across feature maps, is empirically effective across standard datasets (Liu et al., 2015). Adaptive schemes further refine the geometric priors to more closely match dataset statistics, yielding measurable accuracy improvements with minimal computational overhead (Thakar et al., 2018). These developments underscore the importance of aligning discretization strategies with data characteristics to optimize dense object localization in single-shot detection frameworks.

PDF Markdown Chat (Pro)

References (2)

SSD: Single Shot MultiBox Detector (2015)

Ensemble-based Adaptive Single-shot Multi-box Detector (2018)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Default Boxes and Aspect Ratios.