Dynamics of Deep Neural Networks and Neural Tangent Hierarchy

Published 18 Sep 2019 in cs.LG and stat.ML | (1909.08156v1)

Abstract: The evolution of a deep neural network trained by the gradient descent can be described by its neural tangent kernel (NTK) as introduced in [20], where it was proven that in the infinite width limit the NTK converges to an explicit limiting kernel and it stays constant during training. The NTK was also implicit in some other papers [6,13,14]. In the overparametrization regime, a fully-trained deep neural network is indeed equivalent to the kernel regression predictor using the limiting NTK. And the gradient descent achieves zero training loss for a deep overparameterized neural network. However, it was observed in [5] that there is a performance gap between the kernel regression using the limiting NTK and the deep neural networks. This performance gap is likely to originate from the change of the NTK along training due to the finite width effect. The change of the NTK along the training is central to describe the generalization features of deep neural networks. In the current paper, we study the dynamic of the NTK for finite width deep fully-connected neural networks. We derive an infinite hierarchy of ordinary differential equations, the neural tangent hierarchy (NTH) which captures the gradient descent dynamic of the deep neural network. Moreover, under certain conditions on the neural network width and the data set dimension, we prove that the truncated hierarchy of NTH approximates the dynamic of the NTK up to arbitrary precision. This description makes it possible to directly study the change of the NTK for deep neural networks, and sheds light on the observation that deep neural networks outperform kernel regressions using the corresponding limiting NTK.

Abstract PDF Upgrade to Chat

Citations (143)

View on Semantic Scholar

Summary

The paper introduces the Neural Tangent Hierarchy (NTH) to describe how the NTK evolves during training in finite-width networks.
It develops a truncated NTH that accurately approximates training dynamics, highlighting the differences between finite and infinite width models.
The work leverages a priori estimates and concentration properties to quantify NTK variation, providing insights into generalization and kernel method discrepancies.

Dynamics of Deep Neural Networks and Neural Tangent Hierarchy

Abstract

The paper "Dynamics of Deep Neural Networks and Neural Tangent Hierarchy" (1909.08156) presents a detailed analysis of the gradient descent dynamics in deep neural networks using the Neural Tangent Kernel (NTK) framework. The study is focused on fully-connected neural networks of finite width and introduces the concept of a Neural Tangent Hierarchy (NTH), an infinite set of ordinary differential equations (ODEs), to describe the evolution of the NTK during training. The authors prove the approximation capabilities of the truncated NTH in matching the training dynamics of real neural networks, thus providing insights into the discrepancies between finite and infinite width network predictions.

Introduction

Deep neural networks have achieved significant breakthroughs in numerous machine learning tasks, yet the understanding of their training dynamics remains limited due to their nonlinear nature and the large parameter space involved. This paper explores how the NTK, an influential concept initially illustrating linear behavior in the infinite width limit, changes during training in networks with finite width. This deviation accounts for observed performance differences between neural network models and kernel methods using the NTK.

Neural Tangent Dynamics

Neural Tangent Kernel at Finite Width

The NTK is defined as a kernel representing the gradient of the network output with respect to its parameters. At infinite width, the NTK remains constant during training, simplifying the dynamics to linear regression. However, at finite width, the NTK evolves, influencing the generalization ability of the network. This evolution poses a challenge and necessitates deeper analysis, which is addressed by deriving the NTH.

Derivation of Neural Tangent Hierarchy

The NTH comprises an infinite hierarchy of ODEs for the NTK and its higher-order analogs. The foundation of the NTH rests on observing that the change in the NTK can be captured by increasingly intricate relationships, reflecting its variation due to architecture and data-dependent factors. The approach uniquely provides a way to quantify dynamics that diverge from the infinite-width idealization.

Theoretical Contributions

A Priori Estimates

Under specific assumptions on data and activation functions, the authors derive a priori estimates for the higher-order kernels in the NTH. These estimates use the concentration properties of initial conditions and leverage bounds on weight norms to demonstrate the NTK’s variation rate, which is shown to be on the order of $1/m$, aligning with empirical observations.

Truncated NTH Approximation

By truncating the NTH to a finite number of levels, the authors develop a practicable approximation of the NTK dynamics. This truncated hierarchy can approximate the NTK behavior to an arbitrary degree of precision, depending on the network’s width. This result is pivotal for constructing a computationally feasible framework to study training dynamics in practical scenarios.

Practical Implications

The findings have profound implications for understanding generalization in finite-width neural networks. The performance gaps between neural networks and kernel regression methods are attributed to the dynamic evolution of the NTK rather than static kernel behavior. Moreover, the insights gained here can guide architectures that exploit such dynamics to enhance learning capabilities.

Future Directions

The study suggests extensions to other architectures such as convolutional and residual networks, and examines discrete-time gradient descent. Extending the current analysis to these areas would elucidate the broader applicability of the NTH. Additionally, exploring the implications of dynamic NTKs for generalization and network pruning are promising directions for subsequent research.

Conclusion

This paper advances the understanding of neural network training dynamics by establishing the Neural Tangent Hierarchy (NTH) framework, which captures NTK evolution in finite-width networks. By providing a robust theoretical foundation and practical approximation strategies, the findings pave the way for future exploration of how these dynamics influence learning in complex models.

Markdown