Greedy Layerwise Learning Can Scale to ImageNet (1812.11446v3)

Published 29 Dec 2018 in cs.LG and stat.ML

Abstract: Shallow supervised 1-hidden layer neural networks have a number of favorable properties that make them easier to interpret, analyze, and optimize than their deep counterparts, but lack their representational power. Here we use 1-hidden layer learning problems to sequentially build deep networks layer by layer, which can inherit properties from shallow networks. Contrary to previous approaches using shallow networks, we focus on problems where deep learning is reported as critical for success. We thus study CNNs on image classification tasks using the large-scale ImageNet dataset and the CIFAR-10 dataset. Using a simple set of ideas for architecture and training we find that solving sequential 1-hidden-layer auxiliary problems lead to a CNN that exceeds AlexNet performance on ImageNet. Extending this training methodology to construct individual layers by solving 2-and-3-hidden layer auxiliary problems, we obtain an 11-layer network that exceeds several members of the VGG model family on ImageNet, and can train a VGG-11 model to the same accuracy as end-to-end learning. To our knowledge, this is the first competitive alternative to end-to-end training of CNNs that can scale to ImageNet. We illustrate several interesting properties of these models theoretically and conduct a range of experiments to study the properties this training induces on the intermediate layers.

Citations (174)

View on Semantic Scholar

Summary

The paper presents a sequential training method where each CNN layer is optimized independently through auxiliary shallow tasks, challenging traditional end-to-end training.
The experimental results show that the approach outperforms AlexNet and rivals VGG on ImageNet and CIFAR-10, confirming its competitive accuracy.
The method reduces computational overhead by eliminating the need for full backpropagation, making it suitable for memory-constrained and large-scale applications.

Greedy Layerwise Learning for Scalable Convolutional Neural Networks

The paper "Greedy Layerwise Learning Can Scale to ImageNet," authored by Eugene Belilovsky, Michael Eickenberg, and Edouard Oyallon, presents an alternative methodology to end-to-end training for deep Convolutional Neural Networks (CNNs), demonstrating that such an approach can be effectively scaled to ImageNet, a large-scale image classification dataset. The authors seek to challenge the prevailing assumption that high-performance CNNs necessitate jointly learned layers, advocating instead for a sequential, layerwise training paradigm.

Overview of Methodology

The central contribution of this work lies in its demonstration that individual layers of a CNN can be trained sequentially through auxiliary learning problems, each solving a shallow learning task. Specifically, the authors focus on CNN architectures where intermediate outputs serve as inputs for subsequent layers, trained independently before progressing to deeper network configurations. This layerwise training technique relies on solving 1-hidden, 2-hidden, and 3-hidden layer auxiliary problems, progressively extending the depth of CNNs while improving their classification accuracy.

Empirical Evaluation

The paper reports extensive experimental results on two major datasets: ImageNet and CIFAR-10. The sequentially trained CNN achieves notable accuracy, exceeding AlexNet performance on ImageNet and rivaling that of certain VGG architectures. On CIFAR-10, the models trained using the proposed greedy layerwise method show better performance compared to traditional unsupervised and handcrafted descriptors. Notably, their $3$-hidden layer models reach competitive performance levels with VGG models and exhibit similar transfer learning capabilities, validating the broad applicability of their approach.

Theoretical Implications

From a theoretical standpoint, the layerwise training paradigm is grounded in well-recognized results concerning shallow networks. While deep networks entail complex interactions across layers that complicate standard theoretical analyses, the authors leverage existing theoretical results applicable to 1-hidden layer networks, positing that greedy layerwise methods could cascade theoretical findings to deeper architectures. This raises questions about the inherent nature of CNNs' learning, specifically concerning progressive linear separability, a property empirically validated to improve across layers, as demonstrated by their evaluations.

Computational Efficiency and Practical Implications

The proposed training methodology offers tangible computational benefits. By addressing layer-by-layer optimization, the need for storing the network's intermediate activations is significantly reduced. This can translate to lower memory requirements, making it suitable for computationally constrained environments. Moreover, the training strategy permits employing larger models in settings where traditional backpropagation methods are infeasible due to hardware limitations.

Future Directions

The paper opens several avenues for future research. Enhancing the efficiency of layerwise training, perhaps by exploring parallel optimization within the framework, could potentiate faster training without compromising performance. Investigating the potential combination of this methodology with other architecture innovations such as residual connections may yield deeper insights and improved results.

In conclusion, this investigation not only challenges entrenched assumptions about deep CNN training paradigms but also provides a strategic alternative that could bear significance for both theoretical explorations in neural network design and practical implementations across diverse AI applications. The demonstrated scalability to ImageNet underscores its potential for broader applications and utility in advancing CNN capabilities beyond traditional learning frameworks.

PDF Markdown