Enhancement of SSD by Concatenating Feature Maps for Object Detection
The paper presents a method that enhances the performance of the Single Shot Multibox Detector (SSD), a prominent object detection algorithm known for its balance between speed and accuracy. The core idea of the proposed method is the effective utilization of feature maps, achieved by modifying the structure of the SSD near the classifier network, rather than expanding the initial layers closer to the input, which involves substituting the VGGNet with a ResNet architecture. This enhancement method aims to increase the mean average precision (mAP) while maintaining or improving computational speed.
Methodology and Results
The authors address several limitations of the conventional SSD, particularly the independent processing of layers in the feature pyramid, which results in issues like multiple detections across scales and poor handling of small objects. To tackle these, they propose a new architecture where the relationship between feature pyramid layers is explicitly considered, and the number of channels in these layers is increased efficiently. By adopting this structure, the method allows for weight sharing among classifier networks of different scales, resulting in faster training with enhanced generalization capabilities.
When evaluated on the Pascal VOC 2007 test set, the proposed network with a input achieved 78.5% mAP at 35.0 FPS, outperforming the conventional SSD, YOLO, Faster R-CNN, and R-FCN in terms of mAP. Additionally, with a input, the network attained an mAP of 80.8% at 16.6 FPS using an Nvidia Titan X GPU, while also being faster than Faster R-CNN and R-FCN but slightly slower than conventional SSD.
Implications and Future Directions
The research suggests that by enhancing the feature pyramid structure and enabling feature map concatenation, object detection systems can achieve better accuracy without a significant compromise in speed. The weight-sharing characteristic across scales not only improves training efficiency but also brings adaptability to datasets with uneven object size distribution, which is particularly advantageous for small databases.
The approach holds potential for further applications, especially in domains requiring real-time processing capabilities and high precision in detecting small and varied-scale objects. Future research could extend these findings to explore the integration of this method with other architectural advancements in Convolutional Neural Networks (CNNs) and further refine the balance between speed and accuracy across various use cases, enabling more efficient object detection in more computationally constrained environments. Additionally, exploring the impact of such enhancements on other backbone networks apart from ResNet could provide insights into the adaptability and robustness of the proposed method.