Neural Tangent Kernel: Convergence and Generalization in Neural Networks (1806.07572v4)

Published 20 Jun 2018 in cs.LG, cs.NE, math.PR, and stat.ML

Abstract: At initialization, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit, thus connecting them to kernel methods. We prove that the evolution of an ANN during training can also be described by a kernel: during gradient descent on the parameters of an ANN, the network function $f_\theta$ (which maps input vectors to output vectors) follows the kernel gradient of the functional cost (which is convex, in contrast to the parameter cost) w.r.t. a new kernel: the Neural Tangent Kernel (NTK). This kernel is central to describe the generalization features of ANNs. While the NTK is random at initialization and varies during training, in the infinite-width limit it converges to an explicit limiting kernel and it stays constant during training. This makes it possible to study the training of ANNs in function space instead of parameter space. Convergence of the training can then be related to the positive-definiteness of the limiting NTK. We prove the positive-definiteness of the limiting NTK when the data is supported on the sphere and the non-linearity is non-polynomial. We then focus on the setting of least-squares regression and show that in the infinite-width limit, the network function $f_\theta$ follows a linear differential equation during training. The convergence is fastest along the largest kernel principal components of the input data with respect to the NTK, hence suggesting a theoretical motivation for early stopping. Finally we study the NTK numerically, observe its behavior for wide networks, and compare it to the infinite-width limit.

Authors (3)

Arthur Jacot (22 papers)
Franck Gabriel (20 papers)
Clément Hongler (24 papers)

Citations (2,861)

View on Semantic Scholar

Summary

The paper introduces NTK, demonstrating that infinite-width neural network training behaves like kernel gradient descent.
It proves NTK’s convergence to a deterministic kernel and its positive-definiteness with non-polynomial activations.
Numerical experiments confirm that finite-width networks approximate the theoretical NTK dynamics, supporting robust generalization.

Neural Tangent Kernel: Convergence and Generalization in Neural Networks

Arthur Jacot, Franck Gabriel, and Clément Hongler, in their seminal paper, introduce the concept of the Neural Tangent Kernel (NTK), offering a robust analytical tool for understanding the behavior of artificial neural networks (ANNs) during gradient descent and their generalization properties. This essay summarizes their key findings, mathematical formulations, and the implications for neural network theory and practice.

Summary of Key Contributions

The paper makes several pivotal contributions to neural network theory:

NTK and Training Dynamics: The paper establishes that, in the infinite-width limit, the training dynamics of neural networks can be approximated by a kernel gradient descent with respect to the NTK. This significant finding allows the paper of ANNs in function space rather than parameter space, thereby simplifying the analysis of convergence and generalization.
Convergence of NTK: The authors prove that as the width of the hidden layers in a neural network increases to infinity, the NTK converges to a deterministic kernel. This convergence implies that the NTK remains constant during training, providing a theoretical foundation for analyzing neural network training as kernel methods.
Positive-Definiteness of NTK: They prove the positive-definiteness of the limiting NTK when the data is on the unit sphere, provided the activation functions are non-polynomial. This result is crucial since positive-definite kernels ensure convergence to global minima for convex loss functions.
Numerical Experiments: The paper includes numerical experiments that validate the theoretical findings. These experiments demonstrate that finite-width ANNs exhibit behavior consistent with the predicted infinite-width limit, underscoring the practical relevance of the NTK approach.

Mathematical Formulations

Network Function and NTK Definition:

The network function $f_\theta$ maps input vectors to output vectors and evolves during training according to the neural tangent kernel:

$\Theta^{(L)}(\theta) = \sum_{p=1}^P \partial_{\theta_p} F^{(L)}(\theta) \otimes \partial_{\theta_p} F^{(L)}(\theta).$

In the infinite-width limit, the NTK converges to:

$\Theta^{(L)}_\infty = \lim_{n_1, \ldots, n_{L-1} \rightarrow \infty} \Theta^{(L)}.$

Convergence and Training Dynamics:

For least-squares regression, the network function follows a linear differential equation determined by the NTK, leading to the following asymptotic behavior:

$f_t = f^* + e^{-t \Pi}(f_0 - f^*),$

where $\Pi(f)$ represents the principal components of the input data with respect to the NTK.

Key Implications and Future Directions

Practical Implications:

Streamlined Analysis: The NTK transforms the complex, non-linear training dynamics of neural networks into a more tractable problem in function space, allowing researchers to leverage kernel methods.
Generalization Understanding: By connecting ANNs to kernel methods, the NTK framework helps explain why over-parametrized networks generalize well, akin to Support Vector Machines that use kernels.
Early Stopping Heuristic: The paper’s findings suggest that early stopping in training aligns with focusing on the largest kernel principal components, thus providing a theoretical underpinning for this common practice.

Theoretical Implications:

Link to Gaussian Processes: The convergence of neural networks to Gaussian processes with NTKs at initialization connects ANN theory with well-established Bayesian methods.
Kernel Methods Revival: The NTK presents kernel methods in a new light, suggesting that their principles can be fundamental in understanding deep learning.

Future Research Directions:

Broader Nonlinearities: Further research could explore NTKs with a wider range of non-polynomial activation functions to extend positive-definiteness results.
Finite-Width Analysis: Investigating the behavior of NTKs in finite-width networks can provide more detailed insights into practical neural network training and may lead to novel regularization techniques.
Complex Architectures: Extending the NTK framework to other neural network architectures, such as convolutional or recurrent networks, could provide deeper insights into the dynamics and generalization of these models.

Conclusion

In conclusion, Jacot, Gabriel, and Hongler present a comprehensive framework that connects the training dynamics of ANNs to kernel methods via the NTK in the infinite-width limit. Their formalization of NTKs, proof of positive-definiteness, and validation through numerical experiments significantly advance the theoretical understanding of neural networks while offering practical insights into training and generalization. Future research building on this framework holds the potential to further bridge the gap between deep learning and traditional kernel methods, fostering new methodologies and theoretical advancements in machine learning.

PDF Markdown

Related Papers

Tweets

https://twitter.com/cloneofsimo/status/1849044187855163707

https://twitter.com/an_interstice/status/1765397128673714681

https://twitter.com/raffi_hotter/status/1907809508963025044

YouTube

Show All Videos