A Modern Take on the Bias-Variance Tradeoff in Neural Networks (1810.08591v4)

Published 19 Oct 2018 in cs.LG and stat.ML

Abstract: The bias-variance tradeoff tells us that as model complexity increases, bias falls and variances increases, leading to a U-shaped test error curve. However, recent empirical results with over-parameterized neural networks are marked by a striking absence of the classic U-shaped test error curve: test error keeps decreasing in wider networks. This suggests that there might not be a bias-variance tradeoff in neural networks with respect to network width, unlike was originally claimed by, e.g., Geman et al. (1992). Motivated by the shaky evidence used to support this claim in neural networks, we measure bias and variance in the modern setting. We find that both bias and variance can decrease as the number of parameters grows. To better understand this, we introduce a new decomposition of the variance to disentangle the effects of optimization and data sampling. We also provide theoretical analysis in a simplified setting that is consistent with our empirical findings.

Citations (160)

View on Semantic Scholar

Summary

The paper challenges the classic bias-variance tradeoff view, showing empirical evidence that bias and variance can decrease simultaneously with increasing width in modern neural networks.
It introduces a novel variance decomposition method that separates variance due to optimization from variance due to data sampling to better understand this phenomenon.
The findings imply that the classic bias-variance tradeoff does not strictly apply to wide neural networks and suggest rethinking model capacity and regularization strategies.

A Modern Take on the Bias-Variance Tradeoff in Neural Networks

The paper "A Modern Take on the Bias-Variance Tradeoff in Neural Networks" challenges the traditional concept of bias-variance tradeoff in the context of over-parameterized neural networks. The standard view suggests that increasing model complexity leads to decreased bias but increased variance, forming a U-shaped curve for test error. However, empirical evidence from contemporary neural networks indicates that test error continues to decrease with network width, a phenomenon at odds with classic bias-variance theory.

Key Findings and Methodology

The authors critically analyze the claim that bias decreases and variance increases with the number of hidden units in neural networks. They find that traditional beliefs, particularly those expressed by Geman et al., do not hold in modern settings. By measuring bias and variance directly in neural networks, instead of relying only on test error analysis, the authors illustrate that both bias and variance can decrease simultaneously as network width increases.

The paper introduces a novel variance decomposition that isolates the effects of optimization from data sampling. This provides a deeper understanding of the observed phenomena in neural networks:

Variance due to Optimization: The authors propose that in the over-parameterized regime, variance due to optimization significantly decreases as width increases.
Variance due to Sampling: Although this component of variance increases, it eventually plateaus, contributing to overall decreased variance in wider networks.

The research includes both theoretical analysis and empirical experiments on various datasets (e.g., MNIST, CIFAR10, SVHN). These experiments consistently show that wider networks do not lead to an increase in prediction variance as conventional wisdom suggests.

Implications and Future Directions

This work implies that the bias-variance tradeoff, a central dogma in machine learning, does not strictly apply to neural networks with large widths. The empirical findings challenge theories based on traditional models and suggest exploring new regularization mechanisms and capacity measures that align with the unique properties of neural networks.

The paper's contribution encourages a reevaluation of model capacity, which might be better understood through probabilistic lenses that account for optimization behavior. Moreover, it hints at practical opportunities in optimizing network design and training processes for better generalization without overfitting in heavily parameterized models.

Future developments should further investigate the theoretical underpinnings of these observations, possibly extending beyond neural networks to other high-capacity models. Additionally, there is potential in studying the dynamics of bias and variance during various phases of network training, providing insights into optimization landscapes shaped by different initializations and learning rates.

Conclusion

In conclusion, the paper provides compelling evidence against the classical view of the bias-variance tradeoff in modern neural networks, painting a complex picture of how network width impacts generalization. By dissecting the components of variance, the authors offer a refined perspective that could drive more effective strategies for leveraging large models in machine learning. This research marks an important step toward reconciling traditional learning theories with the realities observed in contemporary network architectures.