CNN Features off-the-shelf: an Astounding Baseline for Recognition (1403.6382v3)

Published 23 Mar 2014 in cs.CV

Abstract: Recent results indicate that the generic descriptors extracted from the convolutional neural networks are very powerful. This paper adds to the mounting evidence that this is indeed the case. We report on a series of experiments conducted for different recognition tasks using the publicly available code and model of the \overfeat network which was trained to perform object classification on ILSVRC13. We use features extracted from the \overfeat network as a generic image representation to tackle the diverse range of recognition tasks of object image classification, scene recognition, fine grained recognition, attribute detection and image retrieval applied to a diverse set of datasets. We selected these tasks and datasets as they gradually move further away from the original task and data the \overfeat network was trained to solve. Astonishingly, we report consistent superior results compared to the highly tuned state-of-the-art systems in all the visual classification tasks on various datasets. For instance retrieval it consistently outperforms low memory footprint methods except for sculptures dataset. The results are achieved using a linear SVM classifier (or $L2$ distance in case of retrieval) applied to a feature representation of size 4096 extracted from a layer in the net. The representations are further modified using simple augmentation techniques e.g. jittering. The results strongly suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual recognition tasks.

Citations (4,884)

View on Semantic Scholar

Summary

The paper shows that off-the-shelf CNN features serve as a robust baseline across diverse recognition tasks.
The methodology leverages a pretrained OverFeat network with linear SVMs and data augmentation to achieve superior performance.
Empirical results across multiple datasets validate that transferring CNN representations simplifies model development while achieving high accuracy.

Overview of "CNN Features off-the-shelf: an Astounding Baseline for Recognition"

The paper "CNN Features off-the-shelf: an Astounding Baseline for Recognition" by Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson investigates the use of convolutional neural networks (CNNs) as feature extractors for a variety of visual recognition tasks. Conducted at the Royal Institute of Technology (KTH) in Stockholm, Sweden, the paper utilizes the OverFeat network trained on ImageNet ILSVRC 2013. The authors demonstrate that CNN features, even when applied off-the-shelf without task-specific fine-tuning, achieve impressive performance across multiple recognition tasks, ranging from object and scene classification to attribute detection and image retrieval.

Main Findings

The primary inquiry of the paper focuses on whether generic descriptors extracted from CNNs can serve as powerful feature representations for a variety of visual recognition tasks, without additional fine-tuning. The authors conducted multiple experiments across different datasets and tasks:

Image Classification:
- Pascal VOC 2007: For object image classification using the Pascal VOC 2007 dataset, the off-the-shelf OverFeat representation combined with a linear SVM outperformed other state-of-the-art methods that also trained outside Pascal VOC.
- MIT-67: In the indoor scene classification on the MIT-67 dataset, the CNN representation achieved superior results, significantly outperforming traditional methods.
Fine-Grained Recognition:
- CUB 200-2011 and Oxford 102 Flowers: When handling fine-grained categories such as bird species and flower types, the CNN features, using bounding box annotations, were highly effective. Augmenting this with simple data augmentation techniques further improved the results.
Attribute Detection:
- UIUC 64 and H3D datasets: The CNN features performed better than the state-of-the-art in object attribute detection. These attributes range from shape and part-based to material attributes. For human attribute detection in the H3D dataset, CNN features outperformed poselets and deformable part models.
Visual Instance Retrieval:
- Multiple Datasets: The CNN-based representation was tested on multiple challenging image retrieval datasets, such as Oxford5k, Paris6k, Sculptures6k, Holidays, and UKBench. The feature extraction was augmented with PCA, whitening, and $L2$ normalization techniques. The CNN features consistently outperformed low-memory footprint methods like BoW and VLAD.

Numerical Results and Key Observations

For Pascal VOC 2007, the CNN-SVM achieved a mean Average Precision (mAP) of 77.2 compared to other methods utilizing training beyond VOC.
The MIT-67 dataset yielded a classification accuracy of 69.0 using the augmented CNN-SVM approach, outperforming most other methods including sophisticated pipelines.
In fine-grained recognition, on CUB 200-2011, the CNN-SVM achieved an accuracy of 61.8 using bounding box annotations, whereas on Oxford 102 Flowers, it outperformed multiple kernel methods with an accuracy of 86.8 without segmentation.
For attribute detection, the CNN-SVM reported an average accuracy of 91.5 on the UIUC 64 dataset and 73.0 mAP on the H3D dataset.
In image retrieval, the spatial search CNN representation achieved a mAP of 68.0 on Oxford5k and 79.5 on Paris6k, marking its viability against traditional methods.

Theoretical and Practical Implications

The empirical results strongly suggest the robustness and general applicability of CNN-derived representations for various vision tasks. From a theoretical perspective, these findings underscore the potency of leveraging deep learning models in a transfer learning paradigm, where features learned from large datasets can be effectively reused for different tasks. The observed high performance across diverse tasks and datasets consolidates the premise that deep convolutional features can serve as a strong baseline.

From a practical standpoint, the adoption of CNN features significantly lowers the complexity of custom-model developments, allowing researchers to capitalize on pretrained networks to achieve high-performance levels. This can democratize access to advanced computer vision capabilities, especially when computational resources for retraining large networks are constrained.

Future Directions

Future work could explore fine-tuning the CNN representations further for specific tasks to harness performance gains. Moreover, integrating CNN features with sophisticated spatial re-ranking, query expansion, and geometric constraints holds promise for improving retrieval robustness. Modularizing and optimizing these approaches for real-time applications across diverse domains could be an impactful area of exploration.

In conclusion, this paper marks a significant step in understanding the transferability and utility of CNN features across a broad spectrum of visual recognition tasks, demonstrating the feasibility of deploying off-the-shelf deep learning frameworks effectively.

PDF Markdown