Reconciling modern machine learning practice and the bias-variance trade-off (1812.11118v2)

Published 28 Dec 2018 in stat.ML and cs.LG

Abstract: Breakthroughs in machine learning are rapidly changing science and society, yet our fundamental understanding of this technology has lagged far behind. Indeed, one of the central tenets of the field, the bias-variance trade-off, appears to be at odds with the observed behavior of methods used in the modern machine learning practice. The bias-variance trade-off implies that a model should balance under-fitting and over-fitting: rich enough to express underlying structure in data, simple enough to avoid fitting spurious patterns. However, in the modern practice, very rich models such as neural networks are trained to exactly fit (i.e., interpolate) the data. Classically, such models would be considered over-fit, and yet they often obtain high accuracy on test data. This apparent contradiction has raised questions about the mathematical foundations of machine learning and their relevance to practitioners. In this paper, we reconcile the classical understanding and the modern practice within a unified performance curve. This "double descent" curve subsumes the textbook U-shaped bias-variance trade-off curve by showing how increasing model capacity beyond the point of interpolation results in improved performance. We provide evidence for the existence and ubiquity of double descent for a wide spectrum of models and datasets, and we posit a mechanism for its emergence. This connection between the performance and the structure of machine learning models delineates the limits of classical analyses, and has implications for both the theory and practice of machine learning.

Authors (4)

Mikhail Belkin (76 papers)
Daniel Hsu (107 papers)
Siyuan Ma (39 papers)
Soumik Mandal (2 papers)

Citations (1,497)

View on Semantic Scholar

Summary

Reconciling Modern Machine Learning Practice and the Bias-Variance Trade-Off

The paper by Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal addresses the emergent divergence between traditional theoretical constructs in machine learning, specifically the bias-variance trade-off, and the empirical successes observed in contemporary machine learning practices.

The Classical Understanding vs. Modern Observations

Traditionally, the bias-variance trade-off has been a foundational concept in machine learning, guiding practitioners to balance between model complexity (capacity) and the risk of overfitting. In classical terms, too simple a model (high bias) fails to capture underlying data structures, whereas overly complex models (high variance) tend to fit noise in the training data, leading to poor generalization.

Surprisingly, contemporary practices employing highly parameterized models, particularly neural networks, routinely achieve near-zero training error without observing the classical signs of overfitting. These models often generalize well despite their complexity, contradicting the expectations set forth by the classical U-shaped risk curve.

Introducing the Double Descent Risk Curve

The central contribution of this paper is the introduction of the "double descent" risk curve, which extends the classical U-shaped bias-variance trade-off curve. The double descent curve illustrates that after reaching the interpolation threshold, where the model complexity just suffices to memorize the training data perfectly, further increases in model capacity actually lead to a reduction in test risk.

Classical Regime: For models in the under-parameterized regime (left side of the interpolation threshold), increasing model capacity decreases both training and test risks until a "sweet spot" is reached.
Modern Interpolating Regime: Post-interpolation threshold (right side of the curve), where models achieve zero training error, further increases in capacity yield a surprising decline in test risk. This region encompasses much of modern machine learning practice.

The empirical evidence for double descent is substantiated across a variety of models and datasets, confirming its ubiquity.

Mechanisms Behind Double Descent

The mechanisms proposed for the emergence of double descent are grounded in the concept of inductive biases:

In models like Random Fourier Features (RFF) and neural networks, double descent is linked to the implicit regularization imposed by training procedures such as SGD, which tend to favor simpler, smoother functions even in highly over-parameterized settings.
For ensemble methods like Random Forests and boosting, the averaging effect of ensemble components inherently leads to smoother and more robust predictions.

This results in models that, despite their complexity, are aligned well with the underlying data distribution, leading to better generalization.

Practical and Theoretical Implications

The practical implications of this research are profound:

Model Selection: It suggests a paradigm shift in how practitioners approach model selection, encouraging the use of highly over-parameterized models coupled with appropriate training techniques.
Optimization: It highlights that larger models are not only capable of achieving better test performance but are also easier to optimize due to the landscape of the empirical risk minimization problem in high dimensions.

Theoretically, this perspective challenges established beliefs about the necessity of regularization and paves the way for new analyses that consider the properties of over-parameterized models.

Future Directions

The findings open several avenues for future research:

Optimization Dynamics: Further examination of the optimization dynamics in over-parameterized regimes.
Implicit Bias: A deeper understanding of the implicit biases induced by different training methods.
Unified Theories: Development of unified theoretical frameworks that better capture the behavior of modern machine learning practices across various model classes.

In conclusion, this paper significantly advances our understanding of model performance, bridging the gap between classical machine learning theory and modern empirical success. The double descent risk curve offers a robust explanation for the efficacy of highly parameterized models, shaping future research and practical methodologies in the field.

Related Papers

Find Related Papers

Tweets

https://twitter.com/ChristophMolnar/status/1763230117369643410

https://twitter.com/CalcCon/status/1764325214186504540

https://twitter.com/4confusedemoji/status/1906250876143960140

https://twitter.com/danja/status/1775091737041448992

https://twitter.com/Joanvelja/status/1889300917901566316

https://twitter.com/trv/status/1842517072594534786

YouTube

Show All Videos