Why bigger is not always better: on finite and infinite neural networks (1910.08013v3)

Published 17 Oct 2019 in stat.ML and cs.LG

Abstract: Recent work has argued that neural networks can be understood theoretically by taking the number of channels to infinity, at which point the outputs become Gaussian process (GP) distributed. However, we note that infinite Bayesian neural networks lack a key facet of the behaviour of real neural networks: the fixed kernel, determined only by network hyperparameters, implies that they cannot do any form of representation learning. The lack of representation or equivalently kernel learning leads to less flexibility and hence worse performance, giving a potential explanation for the inferior performance of infinite networks observed in the literature (e.g. Novak et al. 2019). We give analytic results characterising the prior over representations and representation learning in finite deep linear networks. We show empirically that the representations in SOTA architectures such as ResNets trained with SGD are much closer to those suggested by our deep linear results than by the corresponding infinite network. This motivates the introduction of a new class of network: infinite networks with bottlenecks, which inherit the theoretical tractability of infinite networks while at the same time allowing representation learning.

Citations (50)

View on Semantic Scholar

Summary

The paper argues that infinite neural networks, despite theoretical tractability, suffer from static kernels unlike flexible finite networks trained with SGD, hindering representation learning and transfer learning.
Empirical tests on finite deep linear networks show they learn kernels similar to SGD-trained architectures like ResNets, but performance degrades in very broad finite networks due to reduced representation learning flexibility.
The paper introduces infinite networks with bottlenecks as a theoretical concept that retains analytical advantages while enabling representation learning, offering a potential middle ground for future architecture design.

Analysis of Finite and Infinite Neural Networks: A Theoretical Perspective

The paper by Laurence Aitchison provides a nuanced examination of the theoretical underpinnings of neural networks, specifically contrasting finite and infinite neural networks. Common wisdom in machine learning suggests that as the size of neural networks increases, their performance generally improves. However, this paper explores the limitations of infinite neural networks, explaining why such an assumption may not hold in all contexts, particularly when considering the flexibility and representation learning capabilities of these networks.

Key Findings

Infinite Bayesian Neural Networks (BNNs) have been posited to transition into Gaussian Processes (GPs) as the number of channels grows infinitely, facilitating precise Bayesian inference. Despite this, the paper argues that infinite networks suffer from a static kernel that cannot be modified or learned, leading to inferior performance compared to finite networks trained with Stochastic Gradient Descent (SGD). This stagnation in kernel development means infinite networks are unable to perform crucial tasks such as representation learning or transfer learning, often pivotal for real-world applications.

To illustrate these points, the paper presents both empirical and analytical work on finite deep linear networks. It finds that these networks display kernel learning abilities more akin to SGD-trained architectures, such as ResNets, than to the static regime depicted by infinite models. These observations are further strengthened by numerical experiments demonstrating that as finite neural networks grow broader, they eventually exhibit degraded performance due to diminishing flexibility in representation learning.

Theoretical Contributions

The paper introduces a significant theoretical concept—infinite networks with bottlenecks—which are analytically tractable yet capable of representation learning. These networks promise a middle ground, retaining theoretical advantages of infinite networks while incorporating finite network flexibility.

The paper examines the dynamics of kernels in these configurations from both a prior and posterior perspective. It demonstrates that in linear setups, deeper, narrower networks allow for more kernel variability, a beneficial trait for adapting to data-specific representations. Furthermore, practical evidence from experiments with ResNets suggests that real-world networks align more closely with finite network behavior rather than their infinite counterparts.

Practical and Theoretical Implications

Practically, these insights suggest caution when considering infinite architectures for tasks requiring adaptive learning. In domains where transfer learning is needed, the inability to learn representationally rich kernels could pose significant limitations. The combination of infinite kernels with finite bottlenecks offers a potential pathway to circumvent these limitations, providing a direction for future research in neural architecture design.

Theoretically, the findings challenge established beliefs about the utility of infinite networks, inspiring new work on how these models can be altered to better align with the successful dynamics of finite contexts. Additionally, the paper beckons researchers to explore the role of network architecture on kernel learning abilities more deeply, especially considering the stochastic properties imparted by sampling methods like Langevin dynamics.

Conclusion and Future Directions

Aitchison's work sheds light on the complex interplay between network size and functional adaptability, urging the field to rethink how infinite network assumptions map onto practical learning capabilities. By highlighting the importance of configuration, this research paves the way for novel network architectures that reconcile theoretical tractability with empirical performance.

Future research could build on these concepts by exploring how bottleneck configurations can be optimized for various learning tasks, and whether the theoretical insights afforded by this framework can translate into tangible improvements in application domains like computer vision, natural language processing, and beyond. Additionally, investigating other forms of regularization and stochastic techniques in the context of network flexibility might provide further avenues for enhancing the adaptability of both finite and infinite neural networks.

Related Papers

YouTube

Show All Videos