Weakly Supervised Deep Detection Networks (1511.02853v4)

Published 9 Nov 2015 in cs.CV

Abstract: Weakly supervised learning of object detection is an important problem in image understanding that still does not have a satisfactory solution. In this paper, we address this problem by exploiting the power of deep convolutional neural networks pre-trained on large-scale image-level classification tasks. We propose a weakly supervised deep detection architecture that modifies one such network to operate at the level of image regions, performing simultaneously region selection and classification. Trained as an image classifier, the architecture implicitly learns object detectors that are better than alternative weakly supervised detection systems on the PASCAL VOC data. The model, which is a simple and elegant end-to-end architecture, outperforms standard data augmentation and fine-tuning techniques for the task of image-level classification as well.

Citations (763)

View on Semantic Scholar

Summary

The paper introduces a two-stream CNN architecture that fuses classification and detection pathways to localize objects using only image-level labels.
It leverages spatial pyramid pooling and end-to-end training with stochastic gradient descent to optimize region feature extraction and reduce annotation costs.
Experiments show state-of-the-art performance on PASCAL VOC datasets, with shallower architectures outperforming deeper ones in weakly supervised settings.

Weakly Supervised Deep Detection Networks

The paper "Weakly Supervised Deep Detection Networks" by Hakan Bilen and Andrea Vedaldi addresses a significant challenge in the field of image understanding: weakly supervised learning of object detection. This task involves learning object detectors by leveraging image-level labels without access to precise bounding box annotations. The authors propose leveraging the power of Convolutional Neural Networks (CNNs) pre-trained on large-scale image-level classification tasks and present a novel weakly supervised detection architecture called Weakly Supervised Deep Detection Network (WSDDN).

Key Contributions

The main contributions of the paper can be summarized as follows:

Two-Stream Architecture: The proposed WSDDN modifies a pre-trained CNN to reason about image regions, incorporating two data streams—one for classification and one for detection. This bifurcated approach allows the network to simultaneously perform region selection and classification, aligning with the dual objectives of identifying salient regions and recognizing object categories.
Training Methodology: The model is trained end-to-end using a combination of region proposals, back-propagation, and image-level supervision. The training uses stochastic gradient descent with momentum, fine-tuning the entire network on the target dataset.
Spatial Pyramid Pooling: The use of spatial pyramid pooling (SPP) is critical in adapting the pre-trained CNN for region-level tasks, enabling efficient feature extraction from image regions.
Novel Fusion of Classification and Detection Streams: The network employs softmax operations differently in the classification and detection streams to produce region scores. These scores are then combined via an element-wise product to predict image-level class labels.
State-of-the-Art Performance: The proposed architecture achieves state-of-the-art results on the PASCAL VOC 2007 and 2010 datasets, demonstrating superior performance to existing weakly supervised detection systems.

Experimental Results

The experimental analysis is thorough and comprehensive:

Detection Performance: The model's detection performance is evaluated using mean Average Precision (mAP) and CorLoc metrics on the PASCAL VOC datasets. The WSDDN achieves significantly better results than existing methods, with an ensemble of models yielding even higher performance.
Comparison of Architectures: Different CNN architectures (VGG-CNN-F, VGG-CNN-M-1024, and VGG-VD16) were evaluated, with findings suggesting that shallower architectures (VGG-CNN-F and VGG-CNN-M-1024) perform better than deeper networks like VGG-VD16 for this specific task.
Object Proposals: The paper includes a comparison of Selective Search Windows (SSW) and Edge Boxes (EB) for generating region proposals. Utilizing box scores from EB and incorporating spatial regularization further enhances detection accuracy.
Classification Performance: Although primarily designed for detection, WSDDN also shows improved image-level classification performance, outperforming standard fine-tuning techniques. This highlights the model's effectiveness in learning robust visual representations from weakly supervised data.

Implications and Future Work

The implications of this research are far-reaching:

Cost-Effective Annotation: By reducing the need for detailed annotations, the methods presented can significantly lower the cost of training object detectors, making it feasible to scale up to large and complex datasets.
Generalization to Other Tasks: The architecture demonstrates that CNNs pre-trained for classification can be adapted for detection tasks with minimal additional supervision, suggesting potential applications in other areas like image segmentation and image captioning.
Further Exploration of Part Detection: One noted limitation is the network's tendency to detect object parts rather than whole objects. Future research could incorporate additional cues to address this and improve the holistic detection of objects.

Conclusion

The paper by Bilen and Vedaldi presents a compelling approach to weakly supervised object detection by modifying pre-trained CNNs to perform region-level reasoning. Their proposed WSDDN architecture achieves impressive results on benchmark datasets, advancing the state of the art in weakly supervised detection. The research not only improves detection performance but also offers insights into efficiently leveraging large-scale image-level annotations, highlighting an important direction for future developments in computer vision and AI.

PDF Markdown