A Taxonomy of Deep Convolutional Neural Nets for Computer Vision (1601.06615v1)

Published 25 Jan 2016 in cs.CV, cs.LG, and cs.MM

Abstract: Traditional architectures for solving computer vision problems and the degree of success they enjoyed have been heavily reliant on hand-crafted features. However, of late, deep learning techniques have offered a compelling alternative -- that of automatically learning problem-specific features. With this new paradigm, every problem in computer vision is now being re-examined from a deep learning perspective. Therefore, it has become important to understand what kind of deep networks are suitable for a given problem. Although general surveys of this fast-moving paradigm (i.e. deep-networks) exist, a survey specific to computer vision is missing. We specifically consider one form of deep networks widely used in computer vision - convolutional neural networks (CNNs). We start with "AlexNet" as our base CNN and then examine the broad variations proposed over time to suit different applications. We hope that our recipe-style survey will serve as a guide, particularly for novice practitioners intending to use deep-learning techniques for computer vision.

Citations (211)

View on Semantic Scholar

Summary

The paper introduces a comprehensive taxonomy categorizing CNN architectures, emphasizing key building blocks and performance improvements.
It details various architectural modifications and training techniques that outperform traditional computer vision methods.
The study outlines open challenges like hyperparameter tuning and adversarial robustness, guiding future innovations in CNN design.

An Overview of "A Taxonomy of Deep Convolutional Neural Nets for Computer Vision"

This paper presents a comprehensive survey specifically targeting the use of Convolutional Neural Networks (CNNs) within the domain of computer vision. The authors, Suraj Srinivas, Ravi Kiran Sarvadevabhatla, and Konda Reddy Mopuri, aim to bridge a gap in existing literature which often provides general overviews of deep learning paradigms without focusing specifically on computer vision applications. The paper explores various architectural modifications and enhancements of CNNs to suit a wide variety of vision tasks, assessing both past accomplishments and guiding future endeavors.

Traditional Versus Deep Learning Methods

Historically, computer vision tasks relied heavily on hand-engineered features like SIFT and HoG, followed by machine learning classifiers such as SVMs. These approaches faced limitations in scalability and adaptability to diverse image classifications. The emergence of deep learning, with its capability to automatically learn features through hierarchical abstractions from raw data, constituted a paradigm shift. The adaptability and performance of CNNs significantly outperformed traditional methods when applied to large datasets such as ImageNet.

CNN Architecture Overview

The authors provide a detailed breakdown of CNN architectures using "AlexNet" as a foundational model. They discuss:

Building Blocks: Including convolutional filters, local connectivity, weight sharing, and nonlinearities like ReLUs.
Depth of Networks: Emphasize the effectiveness of deep networks, demonstrating how architectures with more layers, such as VGG and GoogleNet, exceed the performance of simpler models.
Learning Algorithms: Highlight gradient descent variations, including SGD with mini-batches, momentum, and adaptive learning rates, which facilitate efficient training.
Regularization Techniques: Discuss dropout as an effective regularization method to mitigate overfitting in large networks.

Variants and Extensions of CNNs

Recognizing the broad applicability of CNNs, this survey categorizes multiple CNN derivatives tailored for specific tasks beyond general image classification:

Region-Based CNNs: Extend CNNs for object detection and localization, overcoming challenges associated with variable-sized objects in images.
Fully Convolutional Networks: Adapt CNNs for per-pixel prediction tasks, advancing semantic segmentation and scene parsing.
Multi-Modal Networks: Integrate complementary data from various modalities, enhancing capabilities in tasks such as RGB-D scene understanding and video classification.
CNNs Integrated with RNNs: Explore combining RNNs with CNNs to handle sequential data, impactful in video analysis and visual question answering.
Hybrid Learning Methods: Leverage CNNs in multi-task learning and metric learning, providing robust solutions for challenges like object recognition and image retrieval.

Implications and Future Directions

The insights provided by the authors on CNN adaptations for a myriad of computer vision tasks underscore the practical advancements and theoretical understanding required for future progress. The paper articulates open problems, such as hyperparameter optimization, adversarial robustness, and unsupervised learning, which continue to challenge researchers. It speculates on how improvements in these areas could broaden the applicability of CNNs in novel contexts, particularly in real-time and autonomous systems.

By framing current research within an extensive taxonomy of CNN configurations and modifications, the authors furnish vision researchers with a crucial reference guide, equipping both novice and seasoned practitioners with knowledge to further harness deep learning's potential in computer vision. The detailed survey not only reflects on existing achievements but also actively encourages innovation and deeper exploration of unsolved challenges in the discipline.