- The paper demonstrates that the performance gap in infinite-width models is driven by suboptimal optimizers rather than limited feature learning.
- It introduces ADAM*, a modified ADAM optimizer, to derive an NTK formulation that emulates the dynamic behavior of finite-width networks.
- Empirical tests using a six-layer transformer confirm that ADAM*-driven NTKs closely match the performance of traditional finite MLP models.
Infinite Width Models That Work: Why Feature Learning Doesn’t Matter as Much as You Think
Introduction
This paper scrutinizes the performance discrepancies between finite and infinite-width Neural Tangent Kernels (NTKs). Contrary to prevailing assumptions, it establishes that the subpar performance of infinite-width NTK models is not due to an inherent lack of feature learning capabilities. Rather, this performance issue arises from the use of rudimentary optimizers like stochastic gradient descent (SGD) in common constructions. By introducing an infinite width limit based on ADAM-like learning dynamics, the researchers empirically demonstrate that this modification effectively bridges the performance gap.
Feature Learning in Infinite Width Models
Infinite-width models have long been criticized for their inability to learn feature representations dynamically. The introduction of NTKs attempted to model training dynamics in the infinite-width limit, but these models have underperformed finite-width counterparts. This has been generally attributed to the freezing of feature layers during training, as established by the "Dynamical Dichotomy" theorem, which asserts that infinite-width models admit a kernel representation only if features remain static.
Surprisingly, this paper shows that infinite-width NTKs have access to a richer feature set than finite models. Given an infinite-dimensional feature vector, the final layer can effectively perform any conceivable operation by appropriately weighting and selecting subfeatures. This insight contradicts the notion that static features are intrinsically limiting.
Optimizer Improvements
The paper identifies the true constraint on infinite-width NTK models: the use of less sophisticated optimizers. While most kernel methods and the NTK framework are traditionally derived using SGD, modern deep-learning architectures achieve higher performance levels with advanced optimizers like ADAM. To address this, the paper formulates an NTK model that captures ADAM-like learning dynamics.
ADAM* - An Infinite-Width-Compatible Optimizer
To derive an infinite-width NTK model with ADAM properties, the paper introduces ADAM*, a variant of ADAM tailored for infinite limits. The ADAM* optimizer modifies the conventional ADAM algorithm by replacing the term for variance (vt) with its expectation over the infinite feature set. This modification ensures that the optimizer remains applicable even in the infinite-width regime.
Theoretical Analysis
Theoretical contributions demonstrate that infinite-width models with ADAM* dynamics align the performance gap with their finite-width counterparts:
- The paper constructs a revised NTK formulation incorporating ADAM*.
- The kernel computation remains efficient at O(t), even in infinite limits.
- Empirical proofs show this NTK captures the beneficial momentum and adaptive learning rate features of ADAM optimizers.
Empirical Validation
The model's empirical performance is validated using a six-layer, decoder-only transformer with an embedding dimension of 512, trained on the C4 dataset. Various configurations were tested:
- Finite MLP with no feature learning.
- Finite NTK using standard construction.
- Infinite NTK using the new ADAM* construction.
- Original, unfrozen, finite MLP.
The experimental results confirm the hypotheses. Traditional NTKs underperform MLP models even without feature learning enabled, thereby dissociating the performance gap from feature learning capabilities. The ADAM* NTK models successfully bridge this gap, closely tracking the performance of frozen feature models.
Implications and Future Developments
This research has significant implications for the development of more efficient machine learning models, particularly in the context of infinite-width neural networks. By resolving the performance discrepancy through optimizer improvements, this paper paves the way for more competitive and expressive infinite-width models. Future work could further investigate the scalability and generalization performance of these models across more diverse datasets and architectures.
Overall, by illuminating the nuances of infinite-width learning and demonstrating the efficacy of ADAM*-like dynamics, this paper sets a new direction for optimizing and understanding infinite-width neural networks. The integration of advanced optimizers opens new avenues for leveraging the theoretical power of infinite models in practical applications.