The Treasure beneath Convolutional Layers: Cross-convolutional-layer Pooling for Image Classification (1411.7466v1)

Published 27 Nov 2014 in cs.CV

Abstract: A number of recent studies have shown that a Deep Convolutional Neural Network (DCNN) pretrained on a large dataset can be adopted as a universal image description which leads to astounding performance in many visual classification tasks. Most of these studies, if not all, adopt activations of the fully-connected layer of a DCNN as the image or region representation and it is believed that convolutional layer activations are less discriminative. This paper, however, advocates that if used appropriately convolutional layer activations can be turned into a powerful image representation which enjoys many advantages over fully-connected layer activations. This is achieved by adopting a new technique proposed in this paper called cross-convolutional-layer pooling. More specifically, it extracts subarrays of feature maps of one convolutional layer as local features and pools the extracted features with the guidance of feature maps of the successive convolutional layer. Compared with exising methods that apply DCNNs in the local feature setting, the proposed method is significantly faster since it requires much fewer times of DCNN forward computation. Moreover, it avoids the domain mismatch issue which is usually encountered when applying fully connected layer activations to describe local regions. By applying our method to four popular visual classification tasks, it is demonstrated that the proposed method can achieve comparable or in some cases significantly better performance than existing fully-connected layer based image representations while incurring much lower computational cost.

Citations (188)

View on Semantic Scholar

Summary

The paper introduces cross-convolutional-layer pooling to transform convolutional activations into effective regional descriptors with reduced computational demand.
The paper demonstrates that utilizing convolutional layers yields robust image representations, achieving up to 73.5% accuracy on fine-grained tasks like Birds-200.
The paper’s method eliminates extra encoding steps and challenges the reliance on fully-connected layers, offering efficient solutions for scene and object classification.

Cross-Convolutional Layer Pooling for Image Classification

The paper presents a novel method for image representation using deep convolutional neural networks (DCNNs), focusing on convolutional layer activations rather than the more commonly used fully-connected layers. The proposed technique, named cross-convolutional-layer pooling, advances the understanding and application of convolutional layers in the context of visual classification tasks. This method leverages convolutional layers' spatial information to create robust image representations at a reduced computational cost compared to traditional methods.

Overview and Methodology

Traditional approaches to utilizing DCNNs for image classification typically emphasize features from the fully-connected layers, which are perceived as more discriminative. This research, however, demonstrates that convolutional layer activations hold untapped potential for powerful image representation when processed with appropriate techniques. The core innovation of the paper lies in transforming convolutional layer features into effective regional descriptors through cross-convolutional-layer pooling.

Rather than extracting a global image representation from the fully-connected layers, the proposed method extracts subarrays of feature maps from one convolutional layer as local features. These are pooled using the guidance of successive layer feature maps. This approach significantly reduces computational demand by minimizing the frequency of DCNN forward computation, a notable advantage over methods requiring multiple forward computations for each local region representation. Moreover, it addresses domain mismatch issues, often encountered with fully-connected layer activations when applied to describe local image regions.

The cross-convolutional-layer pooling combines several channel-specific pooling results without additional dictionary learning or encoding steps. This simplicity offers a computationally efficient alternative, while extensive experiments on datasets, including MIT-67 and Birds-200, demonstrate comparable or even superior performance to fully-connected layer methodologies, particularly in fine-grained and scene classification tasks.

Numerical Results

The technique was rigorously tested across four datasets for different visual classification tasks: scene classification (MIT-67), fine-grained object classification (Birds-200), generic object classification (Pascal VOC 2007), and attribute classification (H3D Human Attributes dataset). The results underline the efficacy of cross-layer pooling, notably achieving a remarkable balance of performance and computational efficiency. For example, on the Birds-200 dataset, the proposed method achieved 73.5% accuracy, outperforming many existing approaches that rely on more computational resources.

Implications and Future Directions

This method's implications are significant both practically and theoretically. Practically, it offers a computationally efficient way to leverage pretrained DCNNs, which could facilitate broader adoption in resource-constrained environments. Theoretically, it challenges the conventional belief that fully-connected layers are inherently superior for generating image representations, highlighting instead that convolutional layers, if appropriately processed, can offer enhanced discriminative power.

Future developments could explore the integration of cross-convolutional-layer pooling with other neural network architectures. Additionally, extending the methodology to support unsupervised or semi-supervised tasks could enable its application in scenarios with limited labeled data. The feature extraction and computational efficiency advancements presented might influence future AI developments beyond image classification, potentially inspiring innovations in related fields such as object detection and semantic segmentation.

In conclusion, this research substantiates the potential of convolutional layers in DCNNs through cross-layer pooling, offering a promising direction for efficient and effective image representation. The impressive balance of accuracy and speed marks a notable contribution to the domain of visual recognition, with ongoing implications for the design and application of neural network models.

PDF Markdown