Is Faster R-CNN Doing Well for Pedestrian Detection?
The paper "Is Faster R-CNN Doing Well for Pedestrian Detection?" by Liliang Zhang, Liang Lin, Xiaodan Liang, and Kaiming He investigates the efficacy of Faster R-CNN for detecting pedestrians, an object detection task with specialized requirements. Despite its success in general object detection, this paper reveals Faster R-CNN's shortcomings in pedestrian detection and proposes improvements by leveraging Region Proposal Network (RPN) and Boosted Forests (BF).
Main Contributions and Findings
The primary contributions of this paper are:
- Performance of RPN as a Stand-Alone Detector: The research demonstrates that RPN, typically a component of Faster R-CNN, performs competitively as an autonomous pedestrian detector. With refined anchors that match the average pedestrian aspect ratio and scaled to cover a wider range, the RPN exhibits high recall rates, which are superior to traditional feature-based methods such as SCF and LDCF.
- Downstream Classifier Degradation: Contrary to expectations, the paper finds that feeding RPN proposals into the Fast R-CNN classifier degrades performance. The authors attribute this to the insufficient resolution of convolutional feature maps used in Fast R-CNN, which adversely affects the detection of small-sized pedestrian instances.
- Enhancements with Boosted Forests: To address the resolution issue and improve handling of hard negative examples, the authors introduce a BF classifier on high-resolution features shared by the RPN. This approach eliminates the necessity for traditional hand-crafted features while improving both accuracy and computational efficiency.
Experimental Results
The paper conducts rigorous evaluations on multiple benchmarks such as Caltech, INRIA, ETH, and KITTI, ensuring comprehensive verification:
- Caltech: The method yields an MR (log-average Miss Rate) of 9.6%, substantially outperforming other state-of-the-art methods.
- INRIA and ETH: On these datasets, leveraging high-resolution features and effective bootstrapping, the proposed method achieves an MR of 6.9% and 30.2%, respectively, surpassing previous best results.
- KITTI: With a mean Average Precision (mAP) of 61.15% on the "Moderate" difficulty level, the method is competitive and maintains practical inference speed at 0.5 seconds per image.
Technical Details and Implementation
Key technical contributions include:
- High-Resolution Feature Extraction: By utilizing shallower layers such as Conv3_3 and Conv4_3, the BF classifier extracts high-resolution features crucial for small object detection.
- Bootstrapping for Hard Negative Mining: The cascaded Boosted Forest leverages effective bootstrapping to iteratively mine hard negatives, improving classifier resilience to false positives.
- Combination of Features: Features extracted from multiple convolutional layers are concatenated for classification, exploiting diverse resolutions without normalization requirements. This flexibility is vital for enhanced performance.
Implications and Future Directions
The paper illustrates the significance of high-resolution features and hard-negative mining in pedestrian detection:
- Practical Implications: For autonomous driving and surveillance applications, the proposed improvements are particularly relevant due to the frequent occurrence of small-sized pedestrians in these scenarios.
- Theoretical Implications: This paper highlights limitations in Fast R-CNN when adapted to tasks requiring high localization accuracy for small objects, suggesting a broader need for tailored approaches in specific object detection domains.
Speculative Future Developments:
- End-to-End Hard Example Mining: The paper notes the potential overlap with methods like Online Hard Example Mining (OHEM). A promising future direction would be a thorough comparative analysis or an integration of online hard mining techniques for further optimization.
- Enhancements in Deep Learning Techniques: More advanced deep learning architectures can be explored to improve feature resolution adaptively, addressing the intrinsic limitations of current pooling layers in handling small instances.
In conclusion, the paper makes substantial contributions by identifying and addressing critical issues in Faster R-CNN for pedestrian detection. The combination of RPN and Boosted Forests presents a robust alternative, showcasing the importance of high-resolution features and effective negative mining strategies. The results and insights from this research have practical value for real-world applications and provide a foundation for future innovations in specialized object detection tasks.