- The paper introduces a novel multi-stage fusion method combining SSD for candidate generation with multiple DNN classifiers to improve detection precision.
- It employs a soft-rejection based fusion strategy that adjusts confidence scores, effectively reducing false positives in pedestrian detection.
- Evaluated on the Caltech Pedestrian dataset, the approach achieves a log-average miss rate of 8.65%, surpassing prior state-of-the-art methods.
Overview of Fused DNN for Pedestrian Detection
The paper "Fused DNN: A deep neural network fusion approach to fast and robust pedestrian detection" introduces an innovative architecture that tackles the challenge of real-time pedestrian detection with high accuracy and speed. The authors propose the Fused Deep Neural Network (F-DNN), which utilizes a multi-stage fusion process to improve pedestrian detection in challenging scenarios such as occlusion and varying scales.
The detection framework consists of three main components: a pedestrian candidate generator using a Single Shot MultiBox Detector (SSD), a classification network composed of multiple deep neural network (DNN) classifiers in parallel, and a pixel-wise semantic segmentation (SS) network for reinforcement. The proposed method employs a soft-rejection based network fusion (SNF) technique, which, unlike hard binary classification, modulates the confidence scores of detected candidates based on the aggregated probabilities from the classifiers. This strategy enhances the robustness and flexibility of detection compared to conventional methods.
Methodology
- Pedestrian Candidate Generation: The authors employ an SSD model, leveraging VGG16 as the base network, to generate a pool of pedestrian candidates. The SSD generates bounding boxes at varying scales and aspect ratios to ensure comprehensive coverage of true pedestrians. This stage aims to maximize recall albeit at the expense of generating false positives.
- Classification Network: To refine the candidate boxes, the system employs a classification network consisting of multiple DNN classifiers such as ResNet-50 and GoogleNet. Each classifier assesses and refines the detected candidates' scores, reducing false positives introduced in the previous stage.
- Soft-Rejection Based DNN Fusion: The SNF method integrates output from multiple classifiers, offering weighted confidence adjustments instead of binary decisions, thereby preventing false rejections of true positives. The fusion harnesses classification probabilities to modulate the confidence scores, achieving a balance between boosting true positives and suppressing false detections.
- Semantic Segmentation Network: For additional robustness, a semantic segmentation network trained on the Cityscapes dataset provides pixel-wise support to the detection pipeline. This network interrelates with the detected bounding boxes to further refine candidate confidence through spatial consistency checks.
Results
The proposed architecture was evaluated on the Caltech Pedestrian dataset, demonstrating superior detection performance. The F-DNN achieved a log-average miss rate (L-AMR) of 8.65% on the 'Reasonable' setting, which was further improved to 8.18% with the integration of the SS network, significantly surpassing previous state-of-the-art methods. Additionally, the model showed robustness across challenging scenarios such as small, occluded, and crowded pedestrian scenes, offering both improved speed and accuracy.
Implications and Future Work
The paper offers significant advancements in pedestrian detection, highlighting the advantages of leveraging a multi-network fusion strategy with a novel fusion architecture that enhances both precision and computational efficiency. This has crucial applications in various domains, including autonomous driving, surveillance, and crowd analysis.
Looking forward, the integration of additional classifiers could further enhance the model's robustness. Additionally, exploring techniques like label smoothing and leveraging object detection for instance-aware semantic segmentation presents a potential avenue for subsequent research. The interplay between semantic segmentation and detection remains a promising field for advancing comprehensive scene understanding frameworks in computer vision.