Tensor Programs IIb: Architectural Universality of Neural Tangent Kernel Training Dynamics (2105.03703v1)

Published 8 May 2021 in cs.LG, cs.NE, and math.PR

Abstract: Yang (2020a) recently showed that the Neural Tangent Kernel (NTK) at initialization has an infinite-width limit for a large class of architectures including modern staples such as ResNet and Transformers. However, their analysis does not apply to training. Here, we show the same neural networks (in the so-called NTK parametrization) during training follow a kernel gradient descent dynamics in function space, where the kernel is the infinite-width NTK. This completes the proof of the architectural universality of NTK behavior. To achieve this result, we apply the Tensor Programs technique: Write the entire SGD dynamics inside a Tensor Program and analyze it via the Master Theorem. To facilitate this proof, we develop a graphical notation for Tensor Programs.

Authors (2)

Greg Yang (35 papers)
Etai Littwin (25 papers)

Citations (58)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tensor Programs IIb: Architectural Universality of Neural Tangent Kernel Training Dynamics (2105.03703v1)

Summary

Related Papers