Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel (2010.15110v1)

Published 28 Oct 2020 in cs.LG and stat.ML

Abstract: In suitably initialized wide networks, small learning rates transform deep neural networks (DNNs) into neural tangent kernel (NTK) machines, whose training dynamics is well-approximated by a linear weight expansion of the network at initialization. Standard training, however, diverges from its linearization in ways that are poorly understood. We study the relationship between the training dynamics of nonlinear deep networks, the geometry of the loss landscape, and the time evolution of a data-dependent NTK. We do so through a large-scale phenomenological analysis of training, synthesizing diverse measures characterizing loss landscape geometry and NTK dynamics. In multiple neural architectures and datasets, we find these diverse measures evolve in a highly correlated manner, revealing a universal picture of the deep learning process. In this picture, deep network training exhibits a highly chaotic rapid initial transient that within 2 to 3 epochs determines the final linearly connected basin of low loss containing the end point of training. During this chaotic transient, the NTK changes rapidly, learning useful features from the training data that enables it to outperform the standard initial NTK by a factor of 3 in less than 3 to 4 epochs. After this rapid chaotic transient, the NTK changes at constant velocity, and its performance matches that of full network training in 15% to 45% of training time. Overall, our analysis reveals a striking correlation between a diverse set of metrics over training time, governed by a rapid chaotic to stable transition in the first few epochs, that together poses challenges and opportunities for the development of more accurate theories of deep learning.

Authors (6)

Stanislav Fort (30 papers)
Gintare Karolina Dziugaite (54 papers)
Mansheej Paul (12 papers)
Sepideh Kharaghani (1 paper)
Daniel M. Roy (73 papers)
Surya Ganguli (73 papers)

Citations (165)

View on Semantic Scholar

Summary

Evaluating Loss Landscape Geometry and Neural Tangent Kernel Dynamics in Deep Learning

The paper "Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel" presents a notable assembly of empirical analyses to explore how deep neural networks (DNNs) differ from their linearized counterparts—specifically, neural tangent kernel (NTK) machines. This paper investigates the interconnected aspects of the training dynamics of nonlinear deep networks, the geometry of loss landscapes, and the time variation of data-dependent NTKs, delivering insights founded on large-scale phenomenological studies.

Key Insights

The paper outlines several significant findings concerning the dynamics of deep network training. It emphasizes that in the initial 2 to 3 epochs of training, the rapid chaotic transient determines the final linearly connected basin of low loss that the network converges towards. This transient phase is characterized by swift alteration in the NTK, with the kernel learning features from the training data and surpassing the standard initial NTK performance by roughly a factor of 3 within less than 3 to 4 epochs. Following this early phase, the NTK changes at a constant velocity, achieving performance parity with complete network training within 15% to 45% of the overall training time.

The paper contributes to the understanding of the generalization properties of DNNs by highlighting the transitions in training dynamics, demonstrating how various metrics evolve in a highly correlated manner. It suggests a universal picture where the interplay between the loss landscape's large-scale structure and local geometry determines the learning trajectory, posing new challenges for theoretical models of deep learning.

Methodological Approach

The authors conducted simultaneous measurements of diverse attributes, offering a holistic view of deep learning dynamics. Notably, they employ the concept of parent-child spawning, where a parent network is trained to an intermediate epoch and subsequently spawns multiple child networks from that epoch. This experimental design helps in analyzing the impact of early training dynamics and stochastic gradient descent (SGD) noise on the low-loss basin selection.

Additionally, the paper proposes the examination of linearized training around an intermediate tangent plane—referred to as Taylorized training—which is contrasted against the training outcomes of the full nonlinear network dynamics. This approach fundamentally asks how much training time a data-dependent NTK requires to learn sufficient features compared to the classic NTK initialized from random kernels. It is deeply insightful in contrasting kernel learning with traditional function space training.

Theoretical and Practical Implications

This study's implications are multifold, impacting both theoretical frameworks and practical applications. By showcasing the early chaotic transient's pivotal role, it calls for deeper investigations into initial training dynamics that could refine theoretical understandings of DNN behaviors. The findings also suggest potential advancements in training protocols, such as adjusting learning rates in a way that strategically aligns with transitional dynamics observed in the early epochs. This could lead to more efficient and effective training regimens in practical deep learning scenarios.

Moreover, the paper raises questions about the limits of existing kernel-based theories, such as NTK theory, which might not fully capture the richness of training dynamics at finite network widths and non-infinitesimal learning rates.

Future Directions

The paper opens the floor for numerous avenues in future deep learning research. There is ample room to develop a comprehensive theoretical framework that coherently integrates diverse empirical observations, from loss landscape understanding to NTK dynamics. A refined understanding of chaotic-to-stable transitions and kernel learning during these phases could yield novel strategies for network design and refinement that leverage deep learning's inherent chaotic nature for better accuracy and robustness.

In conclusion, by addressing the intertwined relationships of loss landscape geometry and learning dynamics, this paper provides foundational insights into the phenomenology of deep learning. It sets a path toward developing more integrative models and approaches that blend empirical observations with refined theoretical analyses, facilitating improved comprehension and advancement in artificial intelligence research.

PDF Markdown