To understand deep learning we need to understand kernel learning (1802.01396v3)

Published 5 Feb 2018 in stat.ML and cs.LG

Abstract: Generalization performance of classifiers in deep learning has recently become a subject of intense study. Deep models, typically over-parametrized, tend to fit the training data exactly. Despite this "overfitting", they perform well on test data, a phenomenon not yet fully understood. The first point of our paper is that strong performance of overfitted classifiers is not a unique feature of deep learning. Using six real-world and two synthetic datasets, we establish experimentally that kernel machines trained to have zero classification or near zero regression error perform very well on test data, even when the labels are corrupted with a high level of noise. We proceed to give a lower bound on the norm of zero loss solutions for smooth kernels, showing that they increase nearly exponentially with data size. We point out that this is difficult to reconcile with the existing generalization bounds. Moreover, none of the bounds produce non-trivial results for interpolating solutions. Second, we show experimentally that (non-smooth) Laplacian kernels easily fit random labels, a finding that parallels results for ReLU neural networks. In contrast, fitting noisy data requires many more epochs for smooth Gaussian kernels. Similar performance of overfitted Laplacian and Gaussian classifiers on test, suggests that generalization is tied to the properties of the kernel function rather than the optimization process. Certain key phenomena of deep learning are manifested similarly in kernel methods in the modern "overfitted" regime. The combination of the experimental and theoretical results presented in this paper indicates a need for new theoretical ideas for understanding properties of classical kernel methods. We argue that progress on understanding deep learning will be difficult until more tractable "shallow" kernel methods are better understood.

Citations (404)

View on Semantic Scholar

Summary

The paper demonstrates that zero-error kernel machines generalize comparably to deep networks even under high label noise.
It uses experiments on multiple datasets to reveal that overfitting does not necessarily impair test performance.
It calls for new theoretical frameworks to better explain the success of over-parameterized models in practice.

Understanding Kernel Learning to Comprehend Deep Learning

The paper "To Understand Deep Learning We Need to Understand Kernel Learning" by Mikhail Belkin, Siyuan Ma, and Soumik Mandal challenges the traditional assumptions surrounding the generalization performance of deep learning models. This work presents a comparative paper involving traditional kernel methods, classically seen as linear but in infinite-dimensional spaces, and modern deep learning techniques, notably over-parameterized neural networks like those using ReLU activations. Herein, I parse out the critical findings and implications that this paper offers to the community of machine learning researchers.

Key Points and Experimental Observations

A primary claim articulated in the paper is that the robust performance of overfitted classifiers is not exclusive to deep learning models but extends to basic kernel machines too. Utilizing six real-world datasets and two synthetic ones, the authors experimentally demonstrate that kernel machines, configured to achieve zero classification error even in high label noise conditions, also generalize well on unseen test data. This observation poses a contradiction to conventional generalization theories, which expect that such a high level of overfitting would considerably impair model performance in practice.

The authors further argue that traditional complexity-based generalization bounds fail to explain the empirical phenomena observed. Through theoretical analysis, they establish a lower bound on the minimum norm of interpolating solutions for smooth kernels, which grows nearly exponentially with dataset size. Yet, they highlight that these bounds are misaligned with empirical results, where overfitted and interpolating models with minimal norm provide strong generalization performance even with significant label noise.

In the case of ReLU neural networks and non-smooth Laplacian kernels, they note these models' capacity to fit random labels easily—pointing towards a high computational reach—while smooth Gaussian kernels require significantly more computational effort. However, test performance remains comparable between Gaussian and Laplacian kernels, aligning with deep network results. This suggests that generalization quality is significantly influenced by the structural properties of kernel functions rather than the optimization process itself.

Implications and Future Directions

These results have substantial implications for understanding deep learning. By spotlighting the similarities between kernel methods and deep structures in generalization performance of overfitted regimes, this research emphasizes the need to re-evaluate the existing interpretations of model capacity and generalization.

One implication is the proposal for a unified perspective to understand deep learning through the lens of kernel learning, as kernel methods can be mathematically defined and can provide precise minimum norm interpolations. Consequently, further exploration is merited into the inductive biases introduced by optimization algorithms such as gradient descent and the effects of initialization, which are postulated to mirror those seen in kernel methods.

Additionally, since complexity-based bounds fail to supply non-trivial assurances for interpolated classifiers in noisy regimes, this paper calls for new theoretical frameworks that might better correlate empirical performance with theoretical promises. Analyses similar to those used for nearest-neighbors could offer insights into the foundational principles of kernel methods and potentially inform deeper network behaviors.

Concluding Remarks

In advocating that understanding kernel methods is imperative to unraveling deep learning’s generalization mysteries, this paper paves the way for extensive computational and theoretical exploration. It challenges the current assumptions prevalent in the analysis of deep learning and kernel machines, proposing that an intricate comprehension of straightforward, analytically tractable kernel methods might illuminate the complexities inherent to deep neural networks. This research underscores the need for fresh theoretical perspectives, potentially fostering pivotal advancements in understanding and harnessing modern machine learning models' true potential.

PDF Markdown

Related Papers

YouTube

Show All Videos