Fused DNN: A deep neural network fusion approach to fast and robust pedestrian detection (1610.03466v2)

Published 11 Oct 2016 in cs.CV

Abstract: We propose a deep neural network fusion architecture for fast and robust pedestrian detection. The proposed network fusion architecture allows for parallel processing of multiple networks for speed. A single shot deep convolutional network is trained as a object detector to generate all possible pedestrian candidates of different sizes and occlusions. This network outputs a large variety of pedestrian candidates to cover the majority of ground-truth pedestrians while also introducing a large number of false positives. Next, multiple deep neural networks are used in parallel for further refinement of these pedestrian candidates. We introduce a soft-rejection based network fusion method to fuse the soft metrics from all networks together to generate the final confidence scores. Our method performs better than existing state-of-the-arts, especially when detecting small-size and occluded pedestrians. Furthermore, we propose a method for integrating pixel-wise semantic segmentation network into the network fusion architecture as a reinforcement to the pedestrian detector. The approach outperforms state-of-the-art methods on most protocols on Caltech Pedestrian dataset, with significant boosts on several protocols. It is also faster than all other methods.

Citations (274)

View on Semantic Scholar

Summary

The paper introduces a novel multi-stage fusion method combining SSD for candidate generation with multiple DNN classifiers to improve detection precision.
It employs a soft-rejection based fusion strategy that adjusts confidence scores, effectively reducing false positives in pedestrian detection.
Evaluated on the Caltech Pedestrian dataset, the approach achieves a log-average miss rate of 8.65%, surpassing prior state-of-the-art methods.

Overview of Fused DNN for Pedestrian Detection

The paper "Fused DNN: A deep neural network fusion approach to fast and robust pedestrian detection" introduces an innovative architecture that tackles the challenge of real-time pedestrian detection with high accuracy and speed. The authors propose the Fused Deep Neural Network (F-DNN), which utilizes a multi-stage fusion process to improve pedestrian detection in challenging scenarios such as occlusion and varying scales.

The detection framework consists of three main components: a pedestrian candidate generator using a Single Shot MultiBox Detector (SSD), a classification network composed of multiple deep neural network (DNN) classifiers in parallel, and a pixel-wise semantic segmentation (SS) network for reinforcement. The proposed method employs a soft-rejection based network fusion (SNF) technique, which, unlike hard binary classification, modulates the confidence scores of detected candidates based on the aggregated probabilities from the classifiers. This strategy enhances the robustness and flexibility of detection compared to conventional methods.

Methodology

Pedestrian Candidate Generation: The authors employ an SSD model, leveraging VGG16 as the base network, to generate a pool of pedestrian candidates. The SSD generates bounding boxes at varying scales and aspect ratios to ensure comprehensive coverage of true pedestrians. This stage aims to maximize recall albeit at the expense of generating false positives.
Classification Network: To refine the candidate boxes, the system employs a classification network consisting of multiple DNN classifiers such as ResNet-50 and GoogleNet. Each classifier assesses and refines the detected candidates' scores, reducing false positives introduced in the previous stage.
Soft-Rejection Based DNN Fusion: The SNF method integrates output from multiple classifiers, offering weighted confidence adjustments instead of binary decisions, thereby preventing false rejections of true positives. The fusion harnesses classification probabilities to modulate the confidence scores, achieving a balance between boosting true positives and suppressing false detections.
Semantic Segmentation Network: For additional robustness, a semantic segmentation network trained on the Cityscapes dataset provides pixel-wise support to the detection pipeline. This network interrelates with the detected bounding boxes to further refine candidate confidence through spatial consistency checks.

Results

The proposed architecture was evaluated on the Caltech Pedestrian dataset, demonstrating superior detection performance. The F-DNN achieved a log-average miss rate (L-AMR) of 8.65% on the 'Reasonable' setting, which was further improved to 8.18% with the integration of the SS network, significantly surpassing previous state-of-the-art methods. Additionally, the model showed robustness across challenging scenarios such as small, occluded, and crowded pedestrian scenes, offering both improved speed and accuracy.

Implications and Future Work

The paper offers significant advancements in pedestrian detection, highlighting the advantages of leveraging a multi-network fusion strategy with a novel fusion architecture that enhances both precision and computational efficiency. This has crucial applications in various domains, including autonomous driving, surveillance, and crowd analysis.

Looking forward, the integration of additional classifiers could further enhance the model's robustness. Additionally, exploring techniques like label smoothing and leveraging object detection for instance-aware semantic segmentation presents a potential avenue for subsequent research. The interplay between semantic segmentation and detection remains a promising field for advancing comprehensive scene understanding frameworks in computer vision.

PDF Markdown