DSSD : Deconvolutional Single Shot Detector (1701.06659v1)

Published 23 Jan 2017 in cs.CV

Abstract: The main contribution of this paper is an approach for introducing additional context into state-of-the-art general object detection. To achieve this we first combine a state-of-the-art classifier (Residual-101[14]) with a fast detection framework (SSD[18]). We then augment SSD+Residual-101 with deconvolution layers to introduce additional large-scale context in object detection and improve accuracy, especially for small objects, calling our resulting system DSSD for deconvolutional single shot detector. While these two contributions are easily described at a high-level, a naive implementation does not succeed. Instead we show that carefully adding additional stages of learned transformations, specifically a module for feed-forward connections in deconvolution and a new output module, enables this new approach and forms a potential way forward for further detection research. Results are shown on both PASCAL VOC and COCO detection. Our DSSD with $513 \times 513$ input achieves 81.5% mAP on VOC2007 test, 80.0% mAP on VOC2012 test, and 33.2% mAP on COCO, outperforming a state-of-the-art method R-FCN[3] on each dataset.

Authors (5)

Cheng-Yang Fu (15 papers)
Wei Liu (1135 papers)
Ananth Ranga (1 paper)
Ambrish Tyagi (9 papers)
Alexander C. Berg (33 papers)

Citations (1,862)

View on Semantic Scholar

Summary

The paper demonstrates that integrating a Residual-101 backbone with SSD significantly improves feature representation and detection accuracy.
The authors introduce deconvolutional layers that up-sample feature maps, enhancing context for small object detection.
Experimental results on VOC and COCO datasets show improved mAP performance, underscoring DSSD's practical benefits.

An Overview of Deconvolutional Single Shot Detector (DSSD)

The paper "DSSD: Deconvolutional Single Shot Detector" by Cheng-Yang Fu et al. proposes an innovative approach to object detection, focusing on enhancing accuracy by integrating context information, particularly for small objects, into the Single Shot MultiBox Detector (SSD) framework.

Core Contributions

Integration of Residual Networks with SSD

The first significant contribution of this paper is combining a state-of-the-art classifier, Residual-101, with the SSD detection framework. While SSD previously utilized the VGG network as its backbone, leveraging the deeper and more sophisticated Residual networks aims to improve feature representation and thus detection accuracy.

Introduction of Deconvolutional Layers

The second pivotal contribution is the addition of deconvolutional layers to the SSD architecture, forming the Deconvolutional Single Shot Detector (DSSD). These layers aim to enhance the inclusion of large-scale contextual information, particularly benefiting the detection of small objects by effectively creating an hourglass-like encoder-decoder structure.

Technical Innovations

Prediction Module

To properly integrate the deeper Residual-101 network within the SSD framework, the authors introduced a new prediction module. This module mitigates the problems encountered with gradients during training when using very deep networks. Various configurations of intermediate residual blocks were experimented with, with a single residual block coupled with additional feedforward connections proving to be the most effective.

Deconvolution Module

The integration of context is further improved by the deconvolution module, which up-samples the feature maps and combines them with earlier layer features using element-wise product operations. This setup ensures that high-level semantic information is injected back into the dense maps across scales, which is crucial for detecting small objects.

Methodology and Experimental Setup

The DSSD framework was rigorously evaluated on the PASCAL VOC and COCO datasets. For VOC2007, training utilized a combination of VOC2007 and VOC2012 trainval data. For COCO, extensive training was performed with large batch sizes to stabilize batch normalization layers. The optimization process consisted of first training the base SSD with the Residual-101 backbone, followed by fine-tuning the deconvolution components, and then the final network fine-tuning.

Performance and Results

The DSSD architecture showed considerable improvements in mean Average Precision (mAP) across both PASCAL VOC and COCO datasets. Noteworthy performance metrics include:

PASCAL VOC2007: DSSD achieved an 81.5% mAP with $513 \times 513$ input, outperforming the R-FCN's 80.5%.
PASCAL VOC2012: DSSD demonstrated an 80.0% mAP, showing a robust improvement, especially on small object categories.
COCO: DSSD attained a 33.2% mAP, again surpassing R-FCN.

Implications and Future Directions

Practical Implications

The advancements provided by DSSD are particularly valuable for real-world scenarios where small object detection is crucial, such as surveillance and autonomous driving. By achieving high accuracy without drastic compromises on speed, DSSD is poised to benefit applications requiring both precision and efficiency.

Theoretical Implications

From a theoretical perspective, DSSD introduces effective ways to integrate high-level semantic context into detection frameworks. The use of deconvolutional layers and the prediction module offer insights into improving gradient flow and feature combination through deep networks.

Potential Future Developments

Future work could explore more efficient deconvolutional methods, potentially aiming for real-time constraints. Additionally, the principles of DSSD could be extended to other detection architectures, like the R-CNN series, further broadening its applicability.

Conclusion

The DSSD paper makes a substantial contribution to the field of object detection by enhancing the SSD framework with deconvolutional layers and a robust backbone network. While maintaining competitive inference speeds, the DSSD approach significantly improves accuracy, especially for small objects, making it a valuable advancement for both academic research and practical applications in AI-driven object detection.

PDF Markdown

Related Papers

YouTube

Show All Videos