Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FSSD: Feature Fusion Single Shot Multibox Detector (1712.00960v4)

Published 4 Dec 2017 in cs.CV

Abstract: SSD (Single Shot Multibox Detector) is one of the best object detection algorithms with both high accuracy and fast speed. However, SSD's feature pyramid detection method makes it hard to fuse the features from different scales. In this paper, we proposed FSSD (Feature Fusion Single Shot Multibox Detector), an enhanced SSD with a novel and lightweight feature fusion module which can improve the performance significantly over SSD with just a little speed drop. In the feature fusion module, features from different layers with different scales are concatenated together, followed by some down-sampling blocks to generate new feature pyramid, which will be fed to multibox detectors to predict the final detection results. On the Pascal VOC 2007 test, our network can achieve 82.7 mAP (mean average precision) at the speed of 65.8 FPS (frame per second) with the input size 300$\times$300 using a single Nvidia 1080Ti GPU. In addition, our result on COCO is also better than the conventional SSD with a large margin. Our FSSD outperforms a lot of state-of-the-art object detection algorithms in both aspects of accuracy and speed. Code is available at https://github.com/lzx1413/CAFFE_SSD/tree/fssd.

Feature Fusion Single Shot Multibox Detector

The paper "Feature Fusion Single Shot Multibox Detector" (FSSD) introduces an enhanced object detection framework based on the widely recognized Single Shot Multibox Detector (SSD). With a focus on addressing challenges related to scale variations in object detection, FSSD integrates a novel feature fusion module that significantly improves upon the original SSD's performance metrics, showing advancements in both accuracy and speed.

Methodology Overview

The core contribution of the paper lies in the implementation of a feature fusion module that amalgamates multi-scale feature maps derived from different convolutional layers within the network. This module ensures a comprehensive utilization of features by concatenating them, which is further refined using down-sampling blocks to generate new feature pyramids feeding into multibox detectors. This contrasts with the traditional SSD, which processes features from different layers independently, leading to inefficiencies and inaccuracies, especially in small object detection.

Experimental Results

The paper presents compelling numerical results, demonstrating FSSD's superiority over SSD and other state-of-the-art object detectors. On the Pascal VOC 2007 dataset, FSSD achieves a mean average precision (mAP) of 82.7 with a processing speed of 65.8 frames per second (FPS) using an Nvidia 1080Ti GPU. This marks a notable improvement over the conventional SSD, especially in the context of small object detection where semantic information is crucial. Furthermore, the FSSD outperforms other algorithms like DSSD, while retaining efficiency comparable to YOLOv2, offering a balance between speed and accuracy without the computational overhead associated with deeper networks like ResNet-101.

Implications and Future Work

The implications of FSSD are substantial for the field of object detection. By effectively fusing multi-scale features with minimal computational burden, FSSD paves the way for more efficient and accurate real-time detection systems. Its architecture could potentially be adapted for more complex models or integrated into frameworks like Mask RCNN, suggesting a direction for future research. Additionally, exploring backbone networks other than VGG16, such as DenseNet or efficient lightweight models, could further enhance FSSD's applicability in varied contexts, particularly where computational resources are limited.

This paper provides a well-structured approach that can facilitate advancements in deploying robust, real-time object detection systems across various applications, reinforcing the utility of feature fusion in convolutional neural networks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2874–2883, 2016. 00128.
  2. R-CNN for Small Object Detection, pages 214–230. Springer International Publishing, Cham, 2017.
  3. R-fcn: Object detection via region-based fully convolutional networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 379–387. Curran Associates, Inc., 2016.
  4. Deformable convolutional networks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  5. BlitzNet: A real-time deep network for scene understanding. In IEEE International Conference on Computer Vision (ICCV), 2017.
  6. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, June 2010.
  7. DSSD : Deconvolutional single shot detector. CoRR, abs/1701.06659, 2017.
  8. R. Girshick. Fast r-cnn. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
  9. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition, 2014.
  10. Mask r-cnn. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  11. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 37(9):1904–1916, 2015.
  12. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  13. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  14. S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In F. R. Bach and D. M. Blei, editors, ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 448–456. JMLR.org, 2015.
  15. Enhancement of SSD by concatenating feature maps for object detection. CoRR, abs/1705.09587, 2017.
  16. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
  17. Hypernet: Towards accurate region proposal generation and joint object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 845–853, 2016. 00026.
  18. Feature pyramid networks for object detection. CoRR, abs/1612.03144, 2016.
  19. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 01470.
  20. SSD: Single shot multibox detector, volume 9905 LNCS of Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pages 21–37. Springer Verlag, Germany, 2016.
  21. Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579, 2015.
  22. Fully convolutional networks for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  23. Stacked Hourglass Networks for Human Pose Estimation, pages 483–499. Springer International Publishing, Cham, 2016.
  24. Learning to refine object segments. In ECCV, 2016.
  25. You only look once: Unified, real-time object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  26. J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  27. Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 91–99. Curran Associates, Inc., 2015.
  28. U-Net: Convolutional Networks for Biomedical Image Segmentation, pages 234–241. Springer International Publishing, Cham, 2015.
  29. Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229, 2013.
  30. Dsod: Learning deeply supervised object detectors from scratch. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  31. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  32. Selective search for object recognition. International Journal of Computer Vision, 104(2):154–171, 2013.
  33. L. Zitnick and P. Dollar. Edge boxes: Locating object proposals from edges. In ECCV. European Conference on Computer Vision, September 2014.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Zuoxin Li (4 papers)
  2. Lu Yang (82 papers)
  3. Fuqiang Zhou (3 papers)
Citations (483)
Github Logo Streamline Icon: https://streamlinehq.com