Deep Learning Through A Telescoping Lens: A Simple Model Provides Empirical Insights On Grokking, Gradient Boosting & Beyond (2411.00247v1)
Abstract: Deep learning sometimes appears to work in unexpected ways. In pursuit of a deeper understanding of its surprising behaviors, we investigate the utility of a simple yet accurate model of a trained neural network consisting of a sequence of first-order approximations telescoping out into a single empirically operational tool for practical analysis. Across three case studies, we illustrate how it can be applied to derive new empirical insights on a diverse range of prominent phenomena in the literature -- including double descent, grokking, linear mode connectivity, and the challenges of applying deep learning on tabular data -- highlighting that this model allows us to construct and extract metrics that help predict and understand the a priori unexpected performance of neural networks. We also demonstrate that this model presents a pedagogical formalism allowing us to isolate components of the training process even in complex contemporary settings, providing a lens to reason about the effects of design choices such as architecture & optimization strategy, and reveals surprising parallels between neural network learning and gradient boosting.
- Disentangling linear mode connectivity. In UniReps: the First Workshop on Unifying Representations in Neural Models, 2023.
- Pathologies of predictive diversity in deep ensembles. Transactions on Machine Learning Research, 2023.
- Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836, 2022.
- Understanding double descent requires a fine-grained bias-variance decomposition. Advances in neural information processing systems, 33:11022–11032, 2020.
- High-dimensional dynamics of generalization error in neural networks. Neural Networks, 132:428–446, 2020.
- Introduction to statistical learning theory. In Summer school on machine learning, pages 169–207. Springer, 2003.
- On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classification. Journal of Multivariate Analysis, 101(10):2499–2518, 2010.
- Mikhail Belkin. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numerica, 30:203–248, 2021.
- Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
- Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020.
- Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
- On the inductive bias of neural tangent kernels. Advances in Neural Information Processing Systems, 32, 2019.
- To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pages 541–549. PMLR, 2018.
- Language models are few-shot learners. Advances in neural information processing systems, 2020.
- Dynamics of training. Advances in Neural Information Processing Systems, 9, 1996.
- Leo Breiman. Random forests. Machine learning, 45:5–32, 2001.
- Random initialisations performing above chance and how to find them. arXiv preprint arXiv:2209.07509, 2022.
- A u-turn on double descent: Rethinking parameter counting in statistical learning. Advances in Neural Information Processing Systems, 36, 2023.
- Why do random forests work? understanding tree ensembles as self-regularizing adaptive smoothers. arXiv preprint arXiv:2402.01502, 2024.
- Finite-sample analysis of interpolating linear classifiers in the overparameterized regime. The Journal of Machine Learning Research, 22(1):5721–5750, 2021.
- Multiple descent: Design your own generalization curve. Advances in Neural Information Processing Systems, 34:8898–8912, 2021.
- On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019.
- Alicia Curth. Classical statistical (in-sample) intuitions don’t generalize well: A note on bias-variance tradeoffs, overfitting and moving from fixed to random designs. arXiv preprint arXiv:2409.18842, 2024.
- Fusing finetuned models for better pretraining. arXiv preprint arXiv:2204.03044, 2022.
- Thomas G Dietterich. Ensemble learning. The handbook of brain theory and neural networks, 2(1):110–125, 2002.
- Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pages 1675–1685. PMLR, 2019.
- Exact expressions for double descent and implicit regularization via surrogate random design. Advances in neural information processing systems, 33:5152–5164, 2020.
- Pedro Domingos. Every model learned by gradient descent is approximately a kernel machine. arXiv preprint arXiv:2012.00152, 2020.
- Double trouble in double descent: Bias and variance (s) in the lazy regime. In International Conference on Machine Learning, pages 2280–2290. PMLR, 2020.
- Essentially no barriers in neural network energy landscape. In International conference on machine learning, pages 1309–1318. PMLR, 2018.
- The role of permutation invariance in linear mode connectivity of neural networks. arXiv preprint arXiv:2110.06296, 2021.
- Topology and geometry of half-rectified network optimization. arXiv preprint arXiv:1611.01540, 2016.
- Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. Advances in Neural Information Processing Systems, 33:5850–5861, 2020.
- Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269. PMLR, 2020.
- Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757, 2019.
- Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
- Neural networks and the bias/variance dilemma. Neural computation, 4(1):1–58, 1992.
- Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018.
- Scaling down deep learning with mnist-1d. In Forty-first International Conference on Machine Learning, 2024.
- Limitations of lazy training of two-layers neural network. Advances in Neural Information Processing Systems, 32, 2019.
- Why do tree-based models still outperform deep learning on typical tabular data? Advances in neural information processing systems, 35:507–520, 2022.
- Neural tangent kernel: A survey. arXiv preprint arXiv:2208.13614, 2022.
- Disentangling feature and lazy training in deep neural networks. Journal of Statistical Mechanics: Theory and Experiment, 2020(11):113301, 2020.
- Mind the spikes: Benign overfitting of kernels and neural networks in fixed dimension. Advances in Neural Information Processing Systems, 36, 2024.
- Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949–986, 2022.
- Generalized additive models. Monographs on statistics and applied probability. Chapman & Hall, 43:335, 1990.
- The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
- Sparse double descent: Where network pruning aggravates overfitting. In International Conference on Machine Learning, pages 8635–8659. PMLR, 2022.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Yerlan Idelbayev. Proper ResNet implementation for CIFAR10/CIFAR100 in PyTorch. https://github.com/akamaster/pytorch_resnet_cifar10. Accessed: 2024-05-15.
- Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.
- Patching open-vocabulary models by interpolating weights. Advances in Neural Information Processing Systems, 35:29262–29277, 2022.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Joint training of deep ensembles fails due to learner collusion. Advances in Neural Information Processing Systems, 36, 2024.
- Dataset difficulty and the role of inductive bias. arXiv preprint arXiv:2401.01867, 2024.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Grokking as the transition from lazy to rich training dynamics. In The Twelfth International Conference on Learning Representations, 2024.
- Learning multiple layers of features from tiny images. 2009.
- Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Grokking in linear estimators–a solvable model that groks without understanding. International Conference on Learning Representations, 2024.
- What causes the test error? going beyond bias-variance via anova. The Journal of Machine Learning Research, 22(1):6925–7006, 2021.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Yi Lin and Yongho Jeon. Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101(474):578–590, 2006.
- Dichotomy of early and late phase implicit biases can provably induce grokking. In The Twelfth International Conference on Learning Representations, 2024.
- Towards understanding grokking: An effective theory of representation learning. Advances in Neural Information Processing Systems, 35:34651–34663, 2022.
- Omnigrok: Grokking beyond algorithmic data. In The Eleventh International Conference on Learning Representations, 2022.
- Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
- Finite versus infinite neural networks: an empirical study. Advances in Neural Information Processing Systems, 33:15156–15172, 2020.
- A brief prehistory of double descent. Proceedings of the National Academy of Sciences, 117(20):10625–10626, 2020.
- Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32, 2019.
- On the linearity of large non-linear models: when and why the tangent kernel is constant. Advances in Neural Information Processing Systems, 33:15954–15964, 2020.
- David MacKay. Bayesian model comparison and backprop nets. Advances in neural information processing systems, 4, 1991.
- The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning. In International Conference on Machine Learning, pages 3325–3334. PMLR, 2018.
- A fast, well-founded approximation to the empirical neural tangent kernel. In International Conference on Machine Learning, pages 25061–25081. PMLR, 2023.
- Rethinking parameter counting in deep models: Effective dimensionality revisited. arXiv preprint arXiv:2003.02139, 2020.
- When do neural nets outperform boosted trees on tabular data? Advances in Neural Information Processing Systems, 36, 2023.
- Grokking beyond neural networks: An empirical exploration with model complexity. Transactions on Machine Learning Research (TMLR), 2024.
- John Moody. The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. Advances in neural information processing systems, 4, 1991.
- Benign, tempered, or catastrophic: Toward a refined taxonomy of overfitting. Advances in Neural Information Processing Systems, 35:1182–1195, 2022.
- Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217, 2023.
- Brady Neal. On the bias-variance tradeoff: Textbooks need an update. arXiv preprint arXiv:1912.08286, 2019.
- Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.
- A modern take on the bias-variance tradeoff in neural networks. arXiv preprint arXiv:1810.08591, 2018.
- What is being transferred in transfer learning? Advances in neural information processing systems, 33:512–523, 2020.
- Optimal regularization can mitigate double descent. arXiv preprint arXiv:2003.01897, 2020.
- Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 7. Granada, Spain, 2011.
- Task arithmetic in the tangent space: Improved editing of pre-trained models. Advances in Neural Information Processing Systems, 36, 2024.
- What can linearized neural networks actually say about generalization? Advances in Neural Information Processing Systems, 34:8998–9010, 2021.
- Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022.
- Simon JD Prince. Understanding Deep Learning. MIT press, 2023.
- Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- Diverse weight averaging for out-of-distribution generalization. Advances in Neural Information Processing Systems, 35:10821–10836, 2022.
- A jamming transition from under-to over-parametrization affects loss landscape and generalization. arXiv preprint arXiv:1810.09665, 2018.
- Dissecting sample hardness: Fine-grained analysis of hardness characterization methods. In The Twelfth International Conference on Learning Representations, 2023.
- Model fusion via optimal transport. Advances in Neural Information Processing Systems, 33:22045–22055, 2020.
- Double descent demystified: Identifying, interpreting & ablating the sources of a deep learning puzzle. arXiv preprint arXiv:2303.14151, 2023.
- The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon. arXiv preprint arXiv:2206.04817, 2022.
- Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 1995.
- Linear and nonlinear extension of the pseudo-inverse solution for learning boolean functions. Europhysics Letters, 9(4):315, 1989.
- Explaining grokking through circuit efficiency. arXiv preprint arXiv:2309.02390, 2023.
- Openml: networked science in machine learning. SIGKDD Explorations, 15(2):49–60, 2013.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pages 23965–23998. PMLR, 2022.
- Explaining the success of adaboost and random forests as interpolating classifiers. Journal of Machine Learning Research, 18(48):1–33, 2017.
- Taxonomizing local versus global structure in neural network loss landscapes. Advances in Neural Information Processing Systems, 34:18722–18733, 2021.
- Alan Jeffares (9 papers)
- Alicia Curth (22 papers)
- Mihaela van der Schaar (321 papers)