Infinite Width Neural Networks with Improved Standard Parameterization
Overview
The paper addresses a significant challenge in the paper of infinite width neural networks by exploring two prevalent parameterizations: the Neural Tangent Kernel (NTK) parameterization and the naive standard parameterization. The authors critique these approaches for their limitations when extrapolated to infinite width and propose an improved standard parameterization that resolves these issues, maintaining the learning dynamics observed in finite width networks.
Key Contributions
The primary contribution of this work lies in the proposal of an improved standard parameterization for infinite width neural networks that preserves essential characteristics of finite width networks. This parameterization ensures that:
- Training Dynamics Consistency: It maintains the influence of relative layer widths on training dynamics, which is a crucial aspect that the NTK parameterization fails to capture as layer width approaches infinity.
- Balanced Learning Rate Scaling: The parameterization allows for a stable learning rate that aligns with those typically used for finite width networks, avoiding divergence in the neural tangent kernel.
- Kernel Accuracy: The proposed kernels exhibit accuracies comparable to NTK parameterization kernels while potentially surpassing them when the width parameters are finely tuned.
Experimental Insights
The paper provides empirical evidence to support the proposed parameterization's efficacy through various experiments:
- The improved standard parameterization's kernels show performance parity with NTK parameterization kernels across multiple architectures.
- Experiments demonstrate that with optimal tuning, the improved standard parameterization can outperform NTK in kernel prediction accuracy.
- Finite width network experiments indicate that networks trained with standard and NTK parameterizations achieve similar performances under identical conditions.
Theoretical Implications
Theoretically, this work suggests that the improved standard parameterization offers a robust pathway for extending finite width network properties into the infinite domain. This approach might enable more accurate theoretical insights into practical network behaviors, given the alignment with standard parameterization learning dynamics.
The paper also discusses interesting observations regarding kernel contributions:
- A Bayesian neural network and a readout layer trained with gradient descent yield identical kernels under NTK parameterization, unlike the standard parameterizations.
- Bias contributions in learning dynamics remain constant with increasing width under standard parameterization, while weight contributions increase, diminishing the bias's relative importance in wide networks.
Practical Implications and Future Directions
Practically, the proposed parameterization can enhance the interpretability and applicability of infinite width models to real-world finite network scenarios, bridging the gap between theory and practice. The release of code in the Neural Tangents library facilitates broader adoption and further experimentation.
Future work may explore:
- Refining tuning strategies for width parameters to systematically improve kernel performance.
- Extending this line of research to other network architectures beyond those studied in this paper.
- Investigating the implications of this parameterization on other learning paradigms such as transfer learning and lifelong learning.
The insights and methods proposed in this work constitute a significant step towards integrating infinite width neural networks into practical applications and expanding theoretical understanding of neural network behaviors across different parameterization schemes.