When Do Neural Networks Outperform Kernel Methods? (2006.13409v2)

Published 24 Jun 2020 in stat.ML, cs.LG, math.ST, and stat.TH

Abstract: For a certain scaling of the initialization of stochastic gradient descent (SGD), wide neural networks (NN) have been shown to be well approximated by reproducing kernel Hilbert space (RKHS) methods. Recent empirical work showed that, for some classification tasks, RKHS methods can replace NNs without a large loss in performance. On the other hand, two-layers NNs are known to encode richer smoothness classes than RKHS and we know of special examples for which SGD-trained NN provably outperform RKHS. This is true even in the wide network limit, for a different scaling of the initialization. How can we reconcile the above claims? For which tasks do NNs outperform RKHS? If covariates are nearly isotropic, RKHS methods suffer from the curse of dimensionality, while NNs can overcome it by learning the best low-dimensional representation. Here we show that this curse of dimensionality becomes milder if the covariates display the same low-dimensional structure as the target function, and we precisely characterize this tradeoff. Building on these results, we present the spiked covariates model that can capture in a unified framework both behaviors observed in earlier work. We hypothesize that such a latent low-dimensional structure is present in image classification. We test numerically this hypothesis by showing that specific perturbations of the training distribution degrade the performances of RKHS methods much more significantly than NNs.

Citations (174)

View on Semantic Scholar

Summary

The paper demonstrates that neural networks excel over kernel methods when data contain low-dimensional signals amidst high-dimensional noise.
The paper introduces the spiked covariates model to quantify effective dimensions and sample complexity, highlighting RKHS methods' limitations with polynomial approximations.
The paper provides theoretical and empirical analysis, including NTK and random features perspectives, to explain NNs' robustness in real-world classification tasks.

When Do Neural Networks Outperform Kernel Methods?

The paper examines when neural networks (NNs) have a performance advantage over kernel methods, particularly reproducing kernel Hilbert space (RKHS) methods, in the context of supervised learning. The debate over the respective capabilities of NNs and RKHS methods hinges on their practical application across various classification tasks. While RKHS methods can approximate wide neural networks under certain conditions, particularly within the "lazy training regime," where neural networks exhibit linear behavior akin to certain kernel methods, the distinction between the two becomes crucial when examining tasks with inherent low-dimensional structures.

The core hypothesis explored in the paper suggests that the advantage of NNs over RKHS methods manifests when input data possess a latent low-dimensional structure that RKHS cannot efficiently capture. NNs can exploit these structures effectively, avoiding the "curse of dimensionality" that plagues traditional kernel methods when dealing with high-dimensional spaces without such structured hierarchies. This is encapsulated in the spiked covariates model introduced by the authors, which provides a framework combining low-dimensional signal representations and noisy high-dimensional covariates.

Here are central observations and results from the paper:

The Spiked Covariates Model: The authors propose a model where data $\mathbf{x}$ are expressed as high-dimensional vectors partitioned into low-dimensional signal covariates and higher-dimensional noise covariates. This model helps in differentiating the scenarios where NNs excel over kernel methods by framing learning as dependent on signal-to-noise ratios in the covariates.
Scaling Dimensions for Efficient Learning: Based on theoretical exploration and empirical validations, the paper quantifies the sample complexity necessary for kernel methods as a function of a derived effective dimension $d_{\text{eff}}$ . For RKHS methods to learn polynomials of degree $\ell$ , $n$ samples proportional to $d_{\text{eff}}^{\ell}$ are necessary. The effective dimension minimizes to the signal subspace size if the noise does not overwhelm the underlying structure.
Random Features and Neural Tangent Kernel (NTK) Theories: The authors provide an in-depth analysis of approximation limits for RF and NT models, key theoretical simplifications of NNs in the infinite width regime. Notably, even as these methods attempt to capture the behavior of NNs, the approximation error persists unless input structures are appropriately low-dimensional.
Empirical Tests on Real-world Datasets: Through experiments, notably adding noise to high-frequency components of images, the authors demonstrate the fragility of RKHS performance against perturbations that disrupt underlying covariate distributions while neural networks adjust better. This showcases NNs' capability to latch onto critical structural aspects, enhancing performance naturally according to the data’s intrinsic properties.

The research underscores the potential of NNs to discover and leverage low-dimensional manifolds within data, illustrating scenarios where they can decisively outclass kernel approaches. The conclusions extend to real-world applications such as image classification tasks, where NNs can inherently compensate for structural complexities and intrinsic noise. The findings present significant considerations for the design of machine learning systems across fields recognizing that NNs' adaptability to dataset hierarchies cannot be sufficiently reproduced by kernel methods.

As a closing point, while the paper clarifies scenarios for NN's superiority, it also emphasizes the vital role of underlying data structure in informing model choice, suggesting that further studies could refine these insights for broader application domains and potentially innovate new hybrid approaches that benefit from the advantages of each.

PDF Markdown

When Do Neural Networks Outperform Kernel Methods? (2006.13409v2)

Summary

When Do Neural Networks Outperform Kernel Methods?

Related Papers