Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Finite Versus Infinite Neural Networks: an Empirical Study (2007.15801v2)

Published 31 Jul 2020 in cs.LG and stat.ML

Abstract: We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods. By doing so, we resolve a variety of open questions related to the study of infinitely wide neural networks. Our experimental results include: kernel methods outperform fully-connected finite-width networks, but underperform convolutional finite width networks; neural network Gaussian process (NNGP) kernels frequently outperform neural tangent (NT) kernels; centered and ensembled finite networks have reduced posterior variance and behave more similarly to infinite networks; weight decay and the use of a large learning rate break the correspondence between finite and infinite networks; the NTK parameterization outperforms the standard parameterization for finite width networks; diagonal regularization of kernels acts similarly to early stopping; floating point precision limits kernel performance beyond a critical dataset size; regularized ZCA whitening improves accuracy; finite network performance depends non-monotonically on width in ways not captured by double descent phenomena; equivariance of CNNs is only beneficial for narrow networks far from the kernel regime. Our experiments additionally motivate an improved layer-wise scaling for weight decay which improves generalization in finite-width networks. Finally, we develop improved best practices for using NNGP and NT kernels for prediction, including a novel ensembling technique. Using these best practices we achieve state-of-the-art results on CIFAR-10 classification for kernels corresponding to each architecture class we consider.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jaehoon Lee (62 papers)
  2. Samuel S. Schoenholz (45 papers)
  3. Jeffrey Pennington (45 papers)
  4. Ben Adlam (25 papers)
  5. Lechao Xiao (28 papers)
  6. Roman Novak (22 papers)
  7. Jascha Sohl-Dickstein (88 papers)
Citations (205)

Summary

Analysis of "Finite Versus Infinite Neural Networks: An Empirical Study"

The paper under review presents a large-scale empirical paper that investigates the correspondence between wide neural networks and kernel methods, addressing key questions about infinitely wide neural networks. This paper uncovers nuanced behavior in terms of performance, divergences, and alignment between these two domains.

First, the authors compare finite and infinite neural networks, revealing that kernel methods (NNGP and NTK) often outperform finite-width, fully-connected networks, yet underperform against finite-width convolutional networks. Specifically, infinite networks display enhanced generalization due to reduced prediction variance afforded by their Gaussian process framework. Importantly, factors like weight decay and high learning rates disrupt this correspondence by pushing finite networks away from the kernel-like training dynamics.

The researchers note the superiority of NNGP kernels over NTK in several classification tasks, challenging prevailing assumptions that weight-space linearization via NTK would necessarily outperform. This finding signals a need for practitioners to prioritize NNGP when both performance and efficiency are critical.

Centering and ensembling finite networks are shown to mimic kernel-like behavior, thereby suggesting their potential in reducing prediction variance and improving accuracy. The authors investigate how these techniques shift model outputs towards the mean predictor, effectively bridging the gap between finite model outputs and their infinite counterparts.

Moreover, practical developments emerge as the authors introduce a layer-wise scaling for L2 regularization in standard parameterization networks. This adjustment markedly improves performance, indicating a path forward to harness the NTK's beneficial regularizing effects in standard settings.

The dataset size exposes limitations of kernel methods due to floating-point precision in kernel computation. Beyond a critical dataset size, precision errors substantially impact performance, emphasizing the importance of numerical stability in scaling kernel approaches.

ZCA regularization emerges as a powerful preprocessing method, yielding notable performance improvements. However, the efficacy of ZCA is shown to be contingent on careful tuning of whitening parameters, underscoring the intricacies of its application to image data.

Finally, an impactful contribution of this paper is demonstrating how ensembling of kernel predictors facilitates data augmentation, overcoming computational challenges tied to scaling the dataset for kernel methods. This finding could propel kernels towards more practical applications, integrating advanced augmentation techniques to rival deep neural networks in vision tasks.

In conclusion, this empirical paper enriches our understanding of finite and infinite neural networks, illustrating their complex interplay and paving the way for refined practices in both settings. The insights regarding numerical stability, regularization, parameterization, and ensemble behavior have notable implications for both theory and application, shedding light on potential improvements and directions for future exploration in deep learning.