On the infinite width limit of neural networks with a standard parameterization (2001.07301v3)

Published 21 Jan 2020 in cs.LG and stat.ML

Abstract: There are currently two parameterizations used to derive fixed kernels corresponding to infinite width neural networks, the NTK (Neural Tangent Kernel) parameterization and the naive standard parameterization. However, the extrapolation of both of these parameterizations to infinite width is problematic. The standard parameterization leads to a divergent neural tangent kernel while the NTK parameterization fails to capture crucial aspects of finite width networks such as: the dependence of training dynamics on relative layer widths, the relative training dynamics of weights and biases, and overall learning rate scale. Here we propose an improved extrapolation of the standard parameterization that preserves all of these properties as width is taken to infinity and yields a well-defined neural tangent kernel. We show experimentally that the resulting kernels typically achieve similar accuracy to those resulting from an NTK parameterization, but with better correspondence to the parameterization of typical finite width networks. Additionally, with careful tuning of width parameters, the improved standard parameterization kernels can outperform those stemming from an NTK parameterization. We release code implementing this improved standard parameterization as part of the Neural Tangents library at https://github.com/google/neural-tangents.

Authors (4)

Jascha Sohl-Dickstein (88 papers)
Roman Novak (22 papers)
Samuel S. Schoenholz (45 papers)
Jaehoon Lee (62 papers)

Citations (47)

View on Semantic Scholar

Summary

Infinite Width Neural Networks with Improved Standard Parameterization

Overview

The paper addresses a significant challenge in the paper of infinite width neural networks by exploring two prevalent parameterizations: the Neural Tangent Kernel (NTK) parameterization and the naive standard parameterization. The authors critique these approaches for their limitations when extrapolated to infinite width and propose an improved standard parameterization that resolves these issues, maintaining the learning dynamics observed in finite width networks.

Key Contributions

The primary contribution of this work lies in the proposal of an improved standard parameterization for infinite width neural networks that preserves essential characteristics of finite width networks. This parameterization ensures that:

Training Dynamics Consistency: It maintains the influence of relative layer widths on training dynamics, which is a crucial aspect that the NTK parameterization fails to capture as layer width approaches infinity.
Balanced Learning Rate Scaling: The parameterization allows for a stable learning rate that aligns with those typically used for finite width networks, avoiding divergence in the neural tangent kernel.
Kernel Accuracy: The proposed kernels exhibit accuracies comparable to NTK parameterization kernels while potentially surpassing them when the width parameters are finely tuned.

Experimental Insights

The paper provides empirical evidence to support the proposed parameterization's efficacy through various experiments:

The improved standard parameterization's kernels show performance parity with NTK parameterization kernels across multiple architectures.
Experiments demonstrate that with optimal tuning, the improved standard parameterization can outperform NTK in kernel prediction accuracy.
Finite width network experiments indicate that networks trained with standard and NTK parameterizations achieve similar performances under identical conditions.

Theoretical Implications

Theoretically, this work suggests that the improved standard parameterization offers a robust pathway for extending finite width network properties into the infinite domain. This approach might enable more accurate theoretical insights into practical network behaviors, given the alignment with standard parameterization learning dynamics.

The paper also discusses interesting observations regarding kernel contributions:

A Bayesian neural network and a readout layer trained with gradient descent yield identical kernels under NTK parameterization, unlike the standard parameterizations.
Bias contributions in learning dynamics remain constant with increasing width under standard parameterization, while weight contributions increase, diminishing the bias's relative importance in wide networks.

Practical Implications and Future Directions

Practically, the proposed parameterization can enhance the interpretability and applicability of infinite width models to real-world finite network scenarios, bridging the gap between theory and practice. The release of code in the Neural Tangents library facilitates broader adoption and further experimentation.

Future work may explore:

Refining tuning strategies for width parameters to systematically improve kernel performance.
Extending this line of research to other network architectures beyond those studied in this paper.
Investigating the implications of this parameterization on other learning paradigms such as transfer learning and lifelong learning.

The insights and methods proposed in this work constitute a significant step towards integrating infinite width neural networks into practical applications and expanding theoretical understanding of neural network behaviors across different parameterization schemes.

PDF Markdown

Related Papers

GitHub

GitHub - google/neural-tangents: Fast and Easy Infinite Neural Networks in Python (2,348 stars)

Tweets

https://twitter.com/deepcohen/status/1752045096608887073