Deep Fried Convnets (1412.7149v4)

Published 22 Dec 2014 in cs.LG, cs.NE, and stat.ML

Abstract: The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.

Citations (263)

View on Semantic Scholar

Summary

The paper introduces the Adaptive Fastfood transform to reduce storage from O(nd) to O(n) and computation from O(nd) to O(n log d) without sacrificing accuracy.
The methodology leverages structured random projections and kernel approximations inspired by Bochner’s theorem and the Fast Hadamard Transform.
Empirical results on MNIST and ImageNet confirm that deep fried convnets maintain or improve performance despite a significant reduction in model parameters.

Overview of Deep Fried Convnets

The paper "Deep Fried Convnets" introduces a new approach for reducing the number of parameters in the fully connected layers of convolutional neural networks (CNNs). This reduction does not compromise on predictive performance, facilitated through the introduction of an Adaptive Fastfood transform. The focus of this technique is primarily on enhancing the storage and computational efficiency of the networks, which is critical for deploying large models on distributed systems and embedded devices.

Key Contributions

The Adaptive Fastfood transform reparameterizes the matrix-vector multiplication of fully connected layers in CNNs. A standard fully connected layer with $d$ inputs and $n$ outputs traditionally incurs $\mathcal{O}(nd)$ storage and computational costs. With the Adaptive Fastfood transform, these costs are reduced to $\mathcal{O}(n)$ and $\mathcal{O}(n \log d)$ , respectively. This enables the creation of what the authors term "deep fried convnets," models that are trainable in an end-to-end manner and support a significant reduction in model size.

In particular, the paper demonstrates that deep fried convnets maintain prediction accuracy on datasets like MNIST and ImageNet while reducing parameters. The core contribution lies in using structured random projections and kernel approximations to replace the dense structures conventionally found in fully connected layers.

Theoretical and Numerical Insights

Theoretical insights are grounded in both structured random projection frameworks and kernel feature approximations. The paper draws from Bochner's theorem and utilizes a variant of the Fast Fourier Transform in the form of the Fast Hadamard Transform to reduce the complexity of computations. Practically, this enhances both training and inference efficiency, especially in memory-constrained environments.

Empirically, the paper reports substantial parameter reductions without degradation in accuracy. On the MNIST dataset, the deep fried convnet achieves a similar or better error rate compared to its denser counterparts but with significantly fewer parameters. For ImageNet, it achieves a performance comparable to state-of-the-art networks with almost half the number of parameters, particularly in settings where Adaptive Fastfood is integrated end-to-end from the training phase.

Implications and Future Directions

The implications of this research are significant for the deployment of CNNs on devices with constrained resources. This work primarily addresses parameter efficiency in the massively parameterized fully connected layers, paving the way for potential deployments of high-performing neural networks on smaller devices without extensive computational power.

The conclusions of this paper suggest a future trajectory that may include further optimization of GPU-based implementations of the Fastfood transform and exploring similar parameter reduction techniques for other network layers, such as the output softmax layer. The reduction of parameters in the fully connected layers also invites opportunities for dynamic architectures that adjust operational complexity based on environmental constraints, leading to more adaptive and efficient real-world AI systems.

Overall, the Adaptive Fastfood transform offers a promising direction for deep learning practitioners aiming to optimize model architectures for efficiency without sacrificing performance, thereby broadening the accessibility and applicability of powerful deep learning models.

PDF Markdown