Rethinking Bias-Variance Trade-off for Generalization of Neural Networks (2002.11328v3)

Published 26 Feb 2020 in cs.LG and stat.ML

Abstract: The classical bias-variance trade-off predicts that bias decreases and variance increase with model complexity, leading to a U-shaped risk curve. Recent work calls this into question for neural networks and other over-parameterized models, for which it is often observed that larger models generalize better. We provide a simple explanation for this by measuring the bias and variance of neural networks: while the bias is monotonically decreasing as in the classical theory, the variance is unimodal or bell-shaped: it increases then decreases with the width of the network. We vary the network architecture, loss function, and choice of dataset and confirm that variance unimodality occurs robustly for all models we considered. The risk curve is the sum of the bias and variance curves and displays different qualitative shapes depending on the relative scale of bias and variance, with the double descent curve observed in recent literature as a special case. We corroborate these empirical results with a theoretical analysis of two-layer linear networks with random first layer. Finally, evaluation on out-of-distribution data shows that most of the drop in accuracy comes from increased bias while variance increases by a relatively small amount. Moreover, we find that deeper models decrease bias and increase variance for both in-distribution and out-of-distribution data.

Authors (5)

Zitong Yang (10 papers)
Yaodong Yu (39 papers)
Chong You (35 papers)
Jacob Steinhardt (88 papers)
Yi Ma (189 papers)

Citations (167)

View on Semantic Scholar

Summary

Rethinking Bias-Variance Trade-off for Generalization of Neural Networks

The paper "Rethinking Bias-Variance Trade-off for Generalization of Neural Networks" presents a meticulous analysis of the bias-variance trade-off, traditionally perceived in statistical learning theory, particularly in the context of neural networks. The authors delve into an area that has seen discrepancy between classical theoretical predictions and practical observations in modern deep neural network architectures.

Central Thesis

The classical bias-variance trade-off suggests that as model complexity increases, bias tends to decrease while variance increases, leading to a U-shaped risk curve. This theoretical perspective implies an optimal point of model complexity where the balance between bias and variance minimizes prediction error. However, empirical evidence from neural networks contradicts this theory, showing that models with higher complexity, such as deep neural networks, often generalize better than simpler ones.

Key Findings

The authors propose a nuanced explanation for this contradiction by examining the behavior of bias and variance in neural networks. They show that while bias maintains its classical characteristic of monotonic decrease with increased model complexity, variance exhibits a bell-shaped, unimodal behavior. This finding is pivotal as it suggests that variance first increases with model complexity and then decreases, ultimately affecting the risk curve in a manner not fully accounted for by classical understanding.

Their experiments span various architectures, loss functions, and datasets, establishing the robustness of their findings across different conditions. Notably, for architectures like ResNet and VGG, the decrease in bias is consistent, while variance demonstrates a non-classical pattern, corroborating the primary claim.

Analysis and Implications

Empirical Analysis: The authors conduct extensive experiments using various neural network architectures, training on datasets like CIFAR10 and MNIST, and evaluating models with mean squared error and cross-entropy loss functions. This comprehensive approach ensures that the observed phenomena are not artifacts of specific conditions or model configurations.
Theoretical Support: A theoretical paper of bias and variance in a two-layer linear network with random Gaussian initialization complements the empirical observations. This model shows that the variance's unimodal behavior can be theoretically justified even in simpler models, potentially pointing to underlying mathematical principles that govern variance reduction in over-parameterized regimes.
Double Descent Phenomenon: The paper sheds light on the double descent risk curve, seen in recent machine learning literature, where risk decreases, increases, and then decreases again as model complexity surpasses data dimensionality. By connecting this phenomenon to unimodal variance, the authors provide a compelling theoretical and empirical basis for this behavior.
Practical Implications: Understanding these dynamics is crucial for the design and training of machine learning models, especially deep neural networks. Practitioners can leverage deeper models with confidence that, despite increasing variance temporarily, they may achieve superior generalization.

Speculation on Future Developments in AI

This paper leads to potential future investigations into regularization techniques and their impact on variance. Given that regularization implicitly helps manage the variance in over-parameterized models, explorations could yield insights into optimizing training regimens and architectures for real-world applications involving deep networks.

Given these insights, further research may target whether these findings apply uniformly across other machine learning paradigms beyond neural networks, possibly influencing model selection and development frameworks in AI.

Conclusion

The paper significantly contributes to reconciling the differences between classical statistical learning theory and current empirical practices in neural networks. It challenges researchers to reconsider bias and variance decomposition and its implications on model generalization, paving the way for innovations in developing models that harness high complexity without compromising on predictive accuracy.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/datagenproc/status/1870502187907862916