Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SSD: Single Shot MultiBox Detector (1512.02325v5)

Published 8 Dec 2015 in cs.CV

Abstract: We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. Our SSD model is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stage and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, MS COCO, and ILSVRC datasets confirm that SSD has comparable accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. Compared to other single stage methods, SSD has much better accuracy, even with a smaller input image size. For $300\times 300$ input, SSD achieves 72.1% mAP on VOC2007 test at 58 FPS on a Nvidia Titan X and for $500\times 500$ input, SSD achieves 75.1% mAP, outperforming a comparable state of the art Faster R-CNN model. Code is available at https://github.com/weiliu89/caffe/tree/ssd .

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Wei Liu (1135 papers)
  2. Dragomir Anguelov (73 papers)
  3. Dumitru Erhan (30 papers)
  4. Christian Szegedy (28 papers)
  5. Scott Reed (32 papers)
  6. Cheng-Yang Fu (15 papers)
  7. Alexander C. Berg (33 papers)
Citations (27,855)

Summary

SSD: Single Shot MultiBox Detector

The paper "SSD: Single Shot MultiBox Detector" by Liu et al. presents a novel approach to object detection that diverges from the prevalent pipeline-based detection methods by employing a single deep neural network. Contrary to traditional methods that typically involve multiple stages for generating proposals and resampling, SSD (Single Shot MultiBox Detector) performs object detection in a single stage.

Methodology and Contributions

The SSD model introduces a unique framework to address object detection efficiently:

  • Output Space Discretization: SSD discretizes the output space of bounding boxes using a set of default boxes with various aspect ratios and scales at each location on several feature maps. At prediction time, the model generates scores for object categories and adjusts the boxes to align better with the actual object shapes.
  • Multi-Scale Feature Maps: The network synthesizes predictions from multiple feature maps with different resolutions, enabling the detection of objects of various sizes naturally.
  • Unified Architecture: SSD eliminates proposal generation and subsequent resampling stages, consolidating all computations within a single network. This simplification enhances training efficiency and integration into broader systems requiring detection components.

Implementation Details

Key elements of the SSD framework include:

  1. Multi-Scale Feature Maps: SSD employs several convolutional layers of different dimensions, each contributing to detection at different scales. This design increases the model’s capability to detect objects over a wide range of sizes.
  2. Convolutional Predictors: For each feature map, small convolutional filters are used to make predictions, thereby ensuring computational efficiency.
  3. Default Boxes and Aspect Ratios: Each feature map cell is associated with a set of default boxes covering different aspect ratios and scales. This setup provides dense coverage of possible bounding box shapes.

Training and Performance

The training methodology for SSD introduces several critical innovations:

  • Matching Strategy: SSD matches default boxes with ground-truth boxes based on Jaccard overlap, simplifying the training process.
  • Loss Functions: The model uses a combination of confidence loss (Softmax loss) for class scores and localization loss (Smooth L1 loss) for bounding box coordinates.
  • Data Augmentation: A comprehensive augmentation strategy involving random sampling of image patches ensures robustness against various object sizes and shapes.

Experimental Evaluation

SSD was extensively evaluated on PASCAL VOC, COCO, and ILSVRC datasets, demonstrating the following performance metrics:

  • On the PASCAL VOC 2007 test set, SSD300 achieved 74.3% mAP at 59 FPS, outperforming Faster R-CNN (73.2% mAP at 7 FPS) and YOLO (63.4% mAP at 45 FPS).
  • For COCO dataset evaluation, SSD512 improved mAP@[0.5:0.95] to 26.8 compared to Faster R-CNN’s 24.2 and maintained robust performance across various object scales.

Implications and Future Directions

The SSD framework presents significant implications for both the practical deployment of object detection systems and ongoing research:

  • Efficiency and Speed: By consolidating detection into a single network, SSD markedly reduces computational overhead, making real-time applications like autonomous driving more feasible.
  • Theoretical Foundation: The approach provides a compelling case for the efficacy of single-stage detectors in maintaining high accuracy without the complexity of multi-stage pipelines.

Future developments in AI could leverage SSD's monolithic design, potentially integrating it into more complex systems involving recurrent neural networks for simultaneous detection and tracking in video data.

In conclusion, the SSD framework proposed by Liu et al. establishes an efficient and accurate approach to object detection, reshaping the landscape for both real-time applications and future research endeavors. The model's unique contributions in terms of multi-scale feature handling and simplification of the detection pipeline offer valuable insights and robust performance that will influence the evolution of object detection technologies.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com