DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition (1310.1531v1)

Published 6 Oct 2013 in cs.CV

Abstract: We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be re-purposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.

Citations (4,888)

View on Semantic Scholar

Summary

The paper demonstrates that features extracted from deep convolutional networks, such as DeCAF, significantly outperform traditional representations in object, scene, and domain adaptation tasks.
The study employs a multi-task learning approach using an ImageNet-trained network to extract features that generalize robustly across diverse visual challenges.
Numerical results on benchmarks like Caltech-101 (86.9%), Caltech-UCSD Birds (64.96%), and SUN-397 (40.94%) highlight the practical impact of DeCAF.

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

The paper "DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition" explores the reusability of deep convolutional network features, specifically those trained on large-scale object recognition tasks, for a diverse array of visual recognition challenges. These challenges include scene recognition, domain adaptation, and fine-grained recognition tasks which differ substantially from the tasks originally used to train the network. The authors also release DeCAF, an open-source implementation of these features.

Introduction

The goal of this research is to assess the generalizability of features extracted from deep convolutional networks. Conventional visual representations have reached a performance plateau, but layered compositional architectures, such as deep convolutional networks, hold promise for capturing domain-relevant semantic clusters effectively. Networks trained on extensive datasets, like ImageNet, have already demonstrated success in large-scale visual tasks. These models, however, tend to overfit when training data is limited. This paper investigates a semi-supervised, multi-task learning approach to leverage these pre-trained models for new tasks with insufficient training data.

Main Findings

The principal findings of this paper validate that features extracted via convolutional networks trained on ImageNet significantly outperform traditional features on multiple vision benchmarks. The DeCAF feature, defined by convolutional network weights, was compared across varying tasks, demonstrating superior performance.

Methodology

The authors used a deep convolutional neural network trained on the ImageNet Large Scale Visual Recognition Challenge 2012 dataset. Various levels of convolutional features were extracted and assessed for their efficacy in tasks like object recognition on Caltech-101, domain adaptation using the Office dataset, fine-grained recognition on the Caltech-UCSD Birds dataset, and scene recognition using the SUN-397 dataset.

Feature Generalization and Visualization

Deep features, such as DeCAF $_6$ and DeCAF $_7$ , showed notable clustering of semantic topics, signifying better generalization to unseen classes. For instance, the t-SNE visualizations in the ILSVRC-2012 validation set indicated stronger semantic clustering for higher network layers compared to conventional features like GIST and LLC.

Numerical Results

Object Recognition (Caltech-101): Training a linear SVM on DeCAF $_6$ with dropout resulted in an 86.9% accuracy, outperforming traditional multi-kernel learning methods.
Domain Adaptation (Office Dataset): DeCAF features demonstrated substantial improvement in cross-domain recognition tasks. The domain shift between Amazon and Webcam domains was effectively minimized, showcasing the robustness of DeCAF.
Fine-Grained Recognition (Caltech-UCSD Birds): Using DeCAF $_6$ within the Deformable Part Descriptor (DPD) framework achieved state-of-the-art accuracy of 64.96%, significantly surpassing prior methods.
Scene Recognition (SUN-397): A logistic regression classifier trained on DeCAF $_7$ attained 40.94% accuracy, exceeding the current state-of-the-art methods.

Implications

The results from the DeCAF features hold important implications for both practical and theoretical advancements in computer vision. Practically, it supports the shift towards leveraging deep features trained on extensive auxiliary datasets for a variety of tasks. Theoretically, it underscores the potential of deep learning architectures in capturing highly generalized semantic features.

Speculation on Future Developments

Given the empirical success of DeCAF, future developments are likely to focus on:

Extension to Other Modalities: Exploring the applicability of similar deep feature learning methodologies to other data modalities such as text, audio, and multimodal inputs.
Efficient Training Mechanisms: Developing scalable training mechanisms that further reduce the computational overhead of deep models.
Broader Applicability: Leveraging these features for even more diverse applications such as retrieval, anomaly detection, and unsupervised learning tasks.

Conclusion

This paper provides a thorough empirical validation of using deep convolutional activation features for various visual recognition tasks, illustrating their superior performance over traditional hand-engineered features. The open-source implementation of DeCAF is a significant contribution that enables the vision research community to further explore and validate the versatility and effectiveness of deep feature representations across a wide range of visual learning paradigms.

PDF Markdown