Gaussian Process Behaviour in Wide Deep Neural Networks (1804.11271v2)

Published 30 Apr 2018 in stat.ML and cs.LG

Abstract: Whilst deep neural networks have shown great empirical success, there is still much work to be done to understand their theoretical properties. In this paper, we study the relationship between random, wide, fully connected, feedforward networks with more than one hidden layer and Gaussian processes with a recursive kernel definition. We show that, under broad conditions, as we make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks. To evaluate convergence rates empirically, we use maximum mean discrepancy. We then compare finite Bayesian deep networks from the literature to Gaussian processes in terms of the key predictive quantities of interest, finding that in some cases the agreement can be very close. We discuss the desirability of Gaussian process behaviour and review non-Gaussian alternative models from the literature.

Citations (535)

View on Semantic Scholar

Summary

The paper establishes that under broad conditions, wide deep neural networks converge in distribution to Gaussian processes via a formal theorem for multi-layer architectures.
It employs a recursive kernel framework and maximum mean discrepancy metrics to empirically validate the convergence on multiple datasets.
The study highlights that interpreting deep networks as Gaussian processes can inform inference strategies in Bayesian deep learning.

Gaussian Process Behavior in Wide Deep Neural Networks

The paper "Gaussian Process Behaviour in Wide Deep Neural Networks" by Matthews et al. explores the intriguing intersection of deep neural networks (DNNs) and Gaussian processes (GPs). This paper provides a formal understanding and extension of the work by Neal (1996), focusing on the behavior of wide, fully connected, feedforward networks with multiple hidden layers. As networks become increasingly wide, their distributional behavior converges to that of a Gaussian process. This work extends previous findings that established this behavior for networks with a single hidden layer.

Key Contributions

Convergence to Gaussian Processes:
- Matthews et al. prove that under broad conditions, random wide deep networks converge in distribution to Gaussian processes. The convergence is elaborated with a formal theorem that extends to networks with more than one hidden layer.
Theoretical Advances:
- The results are grounded in the theoretical framework detailing the recursive kernel definition, backed by empirical evaluation using maximum mean discrepancy (MMD) metrics. The paper offers a rigorous mathematical underpinning to the notion that wider network architectures can be interpreted as GPs.
Empirical Verification:
- The authors utilize MMD to empirically paper the convergence rates, comparing finite Bayesian deep networks to their GP analogues on various datasets. The results indicate close agreements between DNNs and GPs on five out of the six datasets considered.
Practical Implications:
- The paper discusses the desirability of GP behavior in networks, suggesting that if such behavior is beneficial or desired, inference using GPs should be considered as an alternative to traditional Bayesian deep learning techniques.

Theoretical Implications

The findings bridge a critical gap in theoretical understanding by detailing how network architectures transform into GP-like behavior as depths and widths increase. This convergence underscores the potency of GPs in offering a closed-form solution to the distributional behavior of large-scale neural networks and emphasizes the deepening understanding of initialisation and learning dynamics in DNNs.

Implications for Bayesian Deep Learning

The work implies that some results in existing Bayesian deep learning literature might be closely replicated by GPs with appropriately defined kernels. The authors encourage the Bayesian deep learning community to routinely compare their model outputs to GPs, advocating for rigorous empirical research as a means of verifying assumptions and results.

Future Directions

The landscape presented in the paper prompts further exploration into whether hierarchical representations, often considered crucial in deep learning, are compromised by GP behavior. It opens avenues to investigate non-Gaussian alternatives or modifications to the architecture that could better encapsulate feature hierarchies without relying on the central limit theorem.

Conclusion

Matthews et al.'s paper provides substantial theoretical and empirical contributions to understanding how wide, deep neural networks behave like Gaussian processes. The implications are far-reaching, offering insights into when and why GP behavior might be desirable in neural networks and suggesting methodological alignment in Bayesian deep learning.

The paper successfully navigates through complex ideas, presenting a significant step towards a deeper theoretical understanding of modern neural architectures and their underlying stochastic processes.

PDF Markdown

Related Papers

Tweets

https://twitter.com/TDamoulas/status/1804449677758079187

YouTube

Show All Videos