Deep Convolutional Networks as shallow Gaussian Processes (1808.05587v2)

Published 16 Aug 2018 in stat.ML and cs.LG

Abstract: We show that the output of a (residual) convolutional neural network (CNN) with an appropriate prior over the weights and biases is a Gaussian process (GP) in the limit of infinitely many convolutional filters, extending similar results for dense networks. For a CNN, the equivalent kernel can be computed exactly and, unlike "deep kernels", has very few parameters: only the hyperparameters of the original CNN. Further, we show that this kernel has two properties that allow it to be computed efficiently; the cost of evaluating the kernel for a pair of images is similar to a single forward pass through the original CNN with only one filter per layer. The kernel equivalent to a 32-layer ResNet obtains 0.84% classification error on MNIST, a new record for GPs with a comparable number of parameters.

Citations (261)

View on Semantic Scholar

Summary

The paper derives a GP kernel from CNNs in the infinite-filter limit for exact Bayesian inference.
It demonstrates efficient kernel computation with a cost similar to a single CNN forward pass.
Empirical evaluation on MNIST using a 32-layer ResNet-derived kernel achieves a 0.84% classification error.

Analysis of "Deep Convolutional Networks as Shallow Gaussian Processes"

This paper presents a theoretical and empirical investigation into the connection between deep Convolutional Neural Networks (CNNs) and Gaussian Processes (GPs), establishing a bridge between these two paradigms of machine learning. The authors extend prior work on the equivalence between densely connected neural networks and GPs to include CNNs and residual neural networks. This work elucidates a method for treating the output of CNNs as GPs under certain conditions.

The core contribution is the derivation of a GP kernel from a CNN with infinitely many convolutional filters, which allows exact Bayesian inference to be performed in these networks. The kernel in question is characterized by a small number of parameters, specifically the hyperparameters of the original CNN, which are advantageous over models with numerous kernel parameters due to potential risks of overfitting.

Key Findings

GP Kernel Derivation for CNNs: The authors demonstrate that the outputs of CNNs can be modeled as GPs in the limit of infinite convolutional filters. This involves equating CNNs with GPs by adopting a specific form for the equivalent kernel, which leverages the structure of CNNs to allow efficient computation. The equivalence is novel to this class of networks and deepens the understanding of how Bayesian approaches can be applied to modern network architectures.
Efficient Kernel Computation: Notably, the computational overhead for evaluating the GP kernel is akin to a forward pass of the equivalent CNN with a single filter per layer. This results in the ability to efficiently compute the kernel, enhancing the practicality of adopting this approach in real-world applications.
Empirical Performance: Applied to the MNIST classification task, the method establishes a new benchmark for GP-based approaches on this dataset, achieving a classification error of 0.84% with a 32-layer ResNet-derived kernel. This performance underscores the potential of using GP equivalents in deep CNN architecture, even without parameteric networks' complexity.

Practical and Theoretical Implications

The results imply that by representing CNNs as GPs, significant strides can be made in improving the uncertainty estimation of neural networks. This could address issues related to adversarial robustness, where CNNs typically show vulnerability. Furthermore, this representation could enhance lifelong and k-shot learning methodologies by incorporating rich, uncertainty-aware predictions.

From a theoretical perspective, this work suggests new pathways for probabilistic reasoning in prevalent neural architectures. By reducing the number of parameters in GP kernels, these models become more tractable for Bayesian inference, thus opening doors to using GPUs in environments previously dominated by deterministic approaches.

Future Directions

One promising direction is the development of multilayered inducing point approximations that maintain computational efficiency. Additionally, exploring other architectures within this GP framework could streamline various machine learning tasks requiring model interpretability and robust uncertainty estimates.

Moreover, future research could extend these methods to handle larger-scale, more complex datasets, potentially leveraging cloud-based parallel computing resources to maintain computational feasibility.

In conclusion, this work lays crucial groundwork for viewing CNNs through the GP lens, offering new insights into how deep learning models can be interpreted and optimized through probabilistic frameworks while maintaining competitive performance metrics. As these ideas are further refined and tested, they may significantly influence the development of machine learning systems in safety-critical and dynamic environments.

PDF Markdown

Related Papers

GitHub

GitHub - convnets-as-gps/convnets-as-gps: Code for "Deep Convolutional Networks as shallow Gaussian Processes" (16 stars)

YouTube

Show All Videos