Enhancement of SSD by concatenating feature maps for object detection (1705.09587v1)

Published 26 May 2017 in cs.CV

Abstract: We propose an object detection method that improves the accuracy of the conventional SSD (Single Shot Multibox Detector), which is one of the top object detection algorithms in both aspects of accuracy and speed. The performance of a deep network is known to be improved as the number of feature maps increases. However, it is difficult to improve the performance by simply raising the number of feature maps. In this paper, we propose and analyze how to use feature maps effectively to improve the performance of the conventional SSD. The enhanced performance was obtained by changing the structure close to the classifier network, rather than growing layers close to the input data, e.g., by replacing VGGNet with ResNet. The proposed network is suitable for sharing the weights in the classifier networks, by which property, the training can be faster with better generalization power. For the Pascal VOC 2007 test set trained with VOC 2007 and VOC 2012 training sets, the proposed network with the input size of 300 x 300 achieved 78.5% mAP (mean average precision) at the speed of 35.0 FPS (frame per second), while the network with a 512 x 512 sized input achieved 80.8% mAP at 16.6 FPS using Nvidia Titan X GPU. The proposed network shows state-of-the-art mAP, which is better than those of the conventional SSD, YOLO, Faster-RCNN and RFCN. Also, it is faster than Faster-RCNN and RFCN.

PDF Abstract

Enhancement of SSD by Concatenating Feature Maps for Object Detection

The paper presents a method that enhances the performance of the Single Shot Multibox Detector (SSD), a prominent object detection algorithm known for its balance between speed and accuracy. The core idea of the proposed method is the effective utilization of feature maps, achieved by modifying the structure of the SSD near the classifier network, rather than expanding the initial layers closer to the input, which involves substituting the VGGNet with a ResNet architecture. This enhancement method aims to increase the mean average precision (mAP) while maintaining or improving computational speed.

Methodology and Results

The authors address several limitations of the conventional SSD, particularly the independent processing of layers in the feature pyramid, which results in issues like multiple detections across scales and poor handling of small objects. To tackle these, they propose a new architecture where the relationship between feature pyramid layers is explicitly considered, and the number of channels in these layers is increased efficiently. By adopting this structure, the method allows for weight sharing among classifier networks of different scales, resulting in faster training with enhanced generalization capabilities.

When evaluated on the Pascal VOC 2007 test set, the proposed network with a $300 \times 300$ input achieved 78.5% mAP at 35.0 FPS, outperforming the conventional SSD, YOLO, Faster R-CNN, and R-FCN in terms of mAP. Additionally, with a $512 \times 512$ input, the network attained an mAP of 80.8% at 16.6 FPS using an Nvidia Titan X GPU, while also being faster than Faster R-CNN and R-FCN but slightly slower than conventional SSD.

Implications and Future Directions

The research suggests that by enhancing the feature pyramid structure and enabling feature map concatenation, object detection systems can achieve better accuracy without a significant compromise in speed. The weight-sharing characteristic across scales not only improves training efficiency but also brings adaptability to datasets with uneven object size distribution, which is particularly advantageous for small databases.

The approach holds potential for further applications, especially in domains requiring real-time processing capabilities and high precision in detecting small and varied-scale objects. Future research could extend these findings to explore the integration of this method with other architectural advancements in Convolutional Neural Networks (CNNs) and further refine the balance between speed and accuracy across various use cases, enabling more efficient object detection in more computationally constrained environments. Additionally, exploring the impact of such enhancements on other backbone networks apart from ResNet could provide insights into the adaptability and robustness of the proposed method.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Jisoo Jeong (19 papers)
Hyojin Park (17 papers)
Nojun Kwak (116 papers)

Citations (302)

View on Semantic Scholar

Enhancement of SSD by concatenating feature maps for object detection (1705.09587v1)