FSSD: Feature Fusion Single Shot Multibox Detector (1712.00960v4)
Abstract: SSD (Single Shot Multibox Detector) is one of the best object detection algorithms with both high accuracy and fast speed. However, SSD's feature pyramid detection method makes it hard to fuse the features from different scales. In this paper, we proposed FSSD (Feature Fusion Single Shot Multibox Detector), an enhanced SSD with a novel and lightweight feature fusion module which can improve the performance significantly over SSD with just a little speed drop. In the feature fusion module, features from different layers with different scales are concatenated together, followed by some down-sampling blocks to generate new feature pyramid, which will be fed to multibox detectors to predict the final detection results. On the Pascal VOC 2007 test, our network can achieve 82.7 mAP (mean average precision) at the speed of 65.8 FPS (frame per second) with the input size 300$\times$300 using a single Nvidia 1080Ti GPU. In addition, our result on COCO is also better than the conventional SSD with a large margin. Our FSSD outperforms a lot of state-of-the-art object detection algorithms in both aspects of accuracy and speed. Code is available at https://github.com/lzx1413/CAFFE_SSD/tree/fssd.
- Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2874–2883, 2016. 00128.
- R-CNN for Small Object Detection, pages 214–230. Springer International Publishing, Cham, 2017.
- R-fcn: Object detection via region-based fully convolutional networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 379–387. Curran Associates, Inc., 2016.
- Deformable convolutional networks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- BlitzNet: A real-time deep network for scene understanding. In IEEE International Conference on Computer Vision (ICCV), 2017.
- The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, June 2010.
- DSSD : Deconvolutional single shot detector. CoRR, abs/1701.06659, 2017.
- R. Girshick. Fast r-cnn. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
- Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition, 2014.
- Mask r-cnn. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 37(9):1904–1916, 2015.
- Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
- Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In F. R. Bach and D. M. Blei, editors, ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 448–456. JMLR.org, 2015.
- Enhancement of SSD by concatenating feature maps for object detection. CoRR, abs/1705.09587, 2017.
- Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
- Hypernet: Towards accurate region proposal generation and joint object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 845–853, 2016. 00026.
- Feature pyramid networks for object detection. CoRR, abs/1612.03144, 2016.
- Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 01470.
- SSD: Single shot multibox detector, volume 9905 LNCS of Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pages 21–37. Springer Verlag, Germany, 2016.
- Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579, 2015.
- Fully convolutional networks for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
- Stacked Hourglass Networks for Human Pose Estimation, pages 483–499. Springer International Publishing, Cham, 2016.
- Learning to refine object segments. In ECCV, 2016.
- You only look once: Unified, real-time object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
- J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 91–99. Curran Associates, Inc., 2015.
- U-Net: Convolutional Networks for Biomedical Image Segmentation, pages 234–241. Springer International Publishing, Cham, 2015.
- Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229, 2013.
- Dsod: Learning deeply supervised object detectors from scratch. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Selective search for object recognition. International Journal of Computer Vision, 104(2):154–171, 2013.
- L. Zitnick and P. Dollar. Edge boxes: Locating object proposals from edges. In ECCV. European Conference on Computer Vision, September 2014.