A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection (1607.07155v1)

Published 25 Jul 2016 in cs.CV

Abstract: A unified deep neural network, denoted the multi-scale CNN (MS-CNN), is proposed for fast multi-scale object detection. The MS-CNN consists of a proposal sub-network and a detection sub-network. In the proposal sub-network, detection is performed at multiple output layers, so that receptive fields match objects of different scales. These complementary scale-specific detectors are combined to produce a strong multi-scale object detector. The unified network is learned end-to-end, by optimizing a multi-task loss. Feature upsampling by deconvolution is also explored, as an alternative to input upsampling, to reduce the memory and computation costs. State-of-the-art object detection performance, at up to 15 fps, is reported on datasets, such as KITTI and Caltech, containing a substantial number of small objects.

PDF Abstract

A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection

The paper "A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection" by Zhaowei Cai, Quanfu Fan, Rogerio S. Feris, and Nuno Vasconcelos, presents a novel approach for handling object detection using a unified multi-scale deep convolutional neural network (MS-CNN). The MS-CNN effectively mitigates the problem of detecting objects of varying scales, which has been a challenge in classical and modern computer vision methods.

Key Contributions

The MS-CNN is composed of two primary components: an object proposal sub-network and an object detection sub-network. Both sub-networks are integrated and trained end-to-end, optimizing a multi-task loss function. Several architectural innovations in the MS-CNN contribute to its effectiveness and efficiency.

Multi-scale Proposal Generation:
- Unlike traditional approaches that rely heavily on input upsampling for small object detection, the MS-CNN conducts detection at multiple output layers, each of which is attuned to specific object scales.
- This multi-scale detection scheme ensures that lower layers with smaller receptive fields detect small objects, while higher layers focus on larger objects.
- This strategy alleviates the fixed receptive field problem seen in previous methods such as Faster-RCNN, where fixed fields could not effectively handle a wide range of object scales in natural scenes.
Feature Upsampling:
- The paper explores feature upsampling via deconvolution layers as an alternative to input upsampling to reduce computational overhead and memory use, yet still enhance object detection response from the network.
- This innovation proves particularly useful in accurately detecting smaller objects without excessively increasing image size or computational requirements.
End-to-end Training and Multi-task Loss:
- Both the proposal and detection tasks are combined in a unified network, and the system is trained using a multi-task loss function.
- Intermediate layers produce object proposals while later layers refine and classify these proposals, ensuring higher accuracy and efficiency.

Performance and Results

The MS-CNN demonstrates state-of-the-art performance on established benchmarks, specifically the KITTI and Caltech datasets, featuring objects across varying scales. Some salient results from the paper include:

KITTI Dataset:
- For cars, the MS-CNN achieved up to 89.02% precision in the moderate difficulty category, surpassing notable predecessors such as Faster-RCNN and 3DOP.
- For pedestrian detection, the MS-CNN set a new benchmark with a precision of 73.70%, highlighting its robustness particularly for smaller and occluded objects.
- The cyclist detection also marked superior results with a precision of 75.46%.
Caltech Pedestrian Dataset:
- The MS-CNN exhibited strong performance in various testing scenarios including reasonable, medium, and partial occlusion. Particularly noteworthy was its robustness for small and occluded object detection, outperforming specialized detectors such as DeepParts.

Implications

The proposed MS-CNN holds significant implications for both practical deployments and future research in object detection:

Practical Deployment:
- The reduction in computational requirements and improved accuracy make MS-CNN feasible for real-time applications such as autonomous driving, surveillance, and robotics.
Future Research:
- The multi-scale detection framework and the feature upsampling strategies present avenues for further exploration, potentially integrating with other deep learning advancements like attention mechanisms or transformer models.
- Incorporating domain-specific knowledge or data augmentation techniques tailored to particular domains might further enhance its robustness and applicability.

Conclusion

The MS-CNN provides a promising approach to the multi-scale object detection problem by leveraging multi-layer outputs tailored to specific object scales and by applying feature map upsampling. The empirical results substantiate its efficacy in improving detection rates while maintaining faster processing speeds. This network's design and outcomes endorse its potential as a formidable tool in next-generation object detection systems across various real-world settings.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Zhaowei Cai (22 papers)
Quanfu Fan (22 papers)
Rogerio S. Feris (9 papers)
Nuno Vasconcelos (79 papers)

Citations (1,466)

View on Semantic Scholar