Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deep Learning Through A Telescoping Lens: A Simple Model Provides Empirical Insights On Grokking, Gradient Boosting & Beyond (2411.00247v1)

Published 31 Oct 2024 in cs.LG, cs.AI, and stat.ML

Abstract: Deep learning sometimes appears to work in unexpected ways. In pursuit of a deeper understanding of its surprising behaviors, we investigate the utility of a simple yet accurate model of a trained neural network consisting of a sequence of first-order approximations telescoping out into a single empirically operational tool for practical analysis. Across three case studies, we illustrate how it can be applied to derive new empirical insights on a diverse range of prominent phenomena in the literature -- including double descent, grokking, linear mode connectivity, and the challenges of applying deep learning on tabular data -- highlighting that this model allows us to construct and extract metrics that help predict and understand the a priori unexpected performance of neural networks. We also demonstrate that this model presents a pedagogical formalism allowing us to isolate components of the training process even in complex contemporary settings, providing a lens to reason about the effects of design choices such as architecture & optimization strategy, and reveals surprising parallels between neural network learning and gradient boosting.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (105)
  1. Disentangling linear mode connectivity. In UniReps: the First Workshop on Unifying Representations in Neural Models, 2023.
  2. Pathologies of predictive diversity in deep ensembles. Transactions on Machine Learning Research, 2023.
  3. Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836, 2022.
  4. Understanding double descent requires a fine-grained bias-variance decomposition. Advances in neural information processing systems, 33:11022–11032, 2020.
  5. High-dimensional dynamics of generalization error in neural networks. Neural Networks, 132:428–446, 2020.
  6. Introduction to statistical learning theory. In Summer school on machine learning, pages 169–207. Springer, 2003.
  7. On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classification. Journal of Multivariate Analysis, 101(10):2499–2518, 2010.
  8. Mikhail Belkin. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numerica, 30:203–248, 2021.
  9. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
  10. Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020.
  11. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
  12. On the inductive bias of neural tangent kernels. Advances in Neural Information Processing Systems, 32, 2019.
  13. To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pages 541–549. PMLR, 2018.
  14. Language models are few-shot learners. Advances in neural information processing systems, 2020.
  15. Dynamics of training. Advances in Neural Information Processing Systems, 9, 1996.
  16. Leo Breiman. Random forests. Machine learning, 45:5–32, 2001.
  17. Random initialisations performing above chance and how to find them. arXiv preprint arXiv:2209.07509, 2022.
  18. A u-turn on double descent: Rethinking parameter counting in statistical learning. Advances in Neural Information Processing Systems, 36, 2023.
  19. Why do random forests work? understanding tree ensembles as self-regularizing adaptive smoothers. arXiv preprint arXiv:2402.01502, 2024.
  20. Finite-sample analysis of interpolating linear classifiers in the overparameterized regime. The Journal of Machine Learning Research, 22(1):5721–5750, 2021.
  21. Multiple descent: Design your own generalization curve. Advances in Neural Information Processing Systems, 34:8898–8912, 2021.
  22. On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019.
  23. Alicia Curth. Classical statistical (in-sample) intuitions don’t generalize well: A note on bias-variance tradeoffs, overfitting and moving from fixed to random designs. arXiv preprint arXiv:2409.18842, 2024.
  24. Fusing finetuned models for better pretraining. arXiv preprint arXiv:2204.03044, 2022.
  25. Thomas G Dietterich. Ensemble learning. The handbook of brain theory and neural networks, 2(1):110–125, 2002.
  26. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pages 1675–1685. PMLR, 2019.
  27. Exact expressions for double descent and implicit regularization via surrogate random design. Advances in neural information processing systems, 33:5152–5164, 2020.
  28. Pedro Domingos. Every model learned by gradient descent is approximately a kernel machine. arXiv preprint arXiv:2012.00152, 2020.
  29. Double trouble in double descent: Bias and variance (s) in the lazy regime. In International Conference on Machine Learning, pages 2280–2290. PMLR, 2020.
  30. Essentially no barriers in neural network energy landscape. In International conference on machine learning, pages 1309–1318. PMLR, 2018.
  31. The role of permutation invariance in linear mode connectivity of neural networks. arXiv preprint arXiv:2110.06296, 2021.
  32. Topology and geometry of half-rectified network optimization. arXiv preprint arXiv:1611.01540, 2016.
  33. Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. Advances in Neural Information Processing Systems, 33:5850–5861, 2020.
  34. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269. PMLR, 2020.
  35. Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757, 2019.
  36. Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
  37. Neural networks and the bias/variance dilemma. Neural computation, 4(1):1–58, 1992.
  38. Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018.
  39. Scaling down deep learning with mnist-1d. In Forty-first International Conference on Machine Learning, 2024.
  40. Limitations of lazy training of two-layers neural network. Advances in Neural Information Processing Systems, 32, 2019.
  41. Why do tree-based models still outperform deep learning on typical tabular data? Advances in neural information processing systems, 35:507–520, 2022.
  42. Neural tangent kernel: A survey. arXiv preprint arXiv:2208.13614, 2022.
  43. Disentangling feature and lazy training in deep neural networks. Journal of Statistical Mechanics: Theory and Experiment, 2020(11):113301, 2020.
  44. Mind the spikes: Benign overfitting of kernels and neural networks in fixed dimension. Advances in Neural Information Processing Systems, 36, 2024.
  45. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949–986, 2022.
  46. Generalized additive models. Monographs on statistics and applied probability. Chapman & Hall, 43:335, 1990.
  47. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
  48. Sparse double descent: Where network pruning aggravates overfitting. In International Conference on Machine Learning, pages 8635–8659. PMLR, 2022.
  49. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  50. Yerlan Idelbayev. Proper ResNet implementation for CIFAR10/CIFAR100 in PyTorch. https://github.com/akamaster/pytorch_resnet_cifar10. Accessed: 2024-05-15.
  51. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.
  52. Patching open-vocabulary models by interpolating weights. Advances in Neural Information Processing Systems, 35:29262–29277, 2022.
  53. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  54. Joint training of deep ensembles fails due to learner collusion. Advances in Neural Information Processing Systems, 36, 2024.
  55. Dataset difficulty and the role of inductive bias. arXiv preprint arXiv:2401.01867, 2024.
  56. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  57. Grokking as the transition from lazy to rich training dynamics. In The Twelfth International Conference on Learning Representations, 2024.
  58. Learning multiple layers of features from tiny images. 2009.
  59. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  60. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  61. Grokking in linear estimators–a solvable model that groks without understanding. International Conference on Learning Representations, 2024.
  62. What causes the test error? going beyond bias-variance via anova. The Journal of Machine Learning Research, 22(1):6925–7006, 2021.
  63. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  64. Yi Lin and Yongho Jeon. Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101(474):578–590, 2006.
  65. Dichotomy of early and late phase implicit biases can provably induce grokking. In The Twelfth International Conference on Learning Representations, 2024.
  66. Towards understanding grokking: An effective theory of representation learning. Advances in Neural Information Processing Systems, 35:34651–34663, 2022.
  67. Omnigrok: Grokking beyond algorithmic data. In The Eleventh International Conference on Learning Representations, 2022.
  68. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
  69. Finite versus infinite neural networks: an empirical study. Advances in Neural Information Processing Systems, 33:15156–15172, 2020.
  70. A brief prehistory of double descent. Proceedings of the National Academy of Sciences, 117(20):10625–10626, 2020.
  71. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32, 2019.
  72. On the linearity of large non-linear models: when and why the tangent kernel is constant. Advances in Neural Information Processing Systems, 33:15954–15964, 2020.
  73. David MacKay. Bayesian model comparison and backprop nets. Advances in neural information processing systems, 4, 1991.
  74. The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning. In International Conference on Machine Learning, pages 3325–3334. PMLR, 2018.
  75. A fast, well-founded approximation to the empirical neural tangent kernel. In International Conference on Machine Learning, pages 25061–25081. PMLR, 2023.
  76. Rethinking parameter counting in deep models: Effective dimensionality revisited. arXiv preprint arXiv:2003.02139, 2020.
  77. When do neural nets outperform boosted trees on tabular data? Advances in Neural Information Processing Systems, 36, 2023.
  78. Grokking beyond neural networks: An empirical exploration with model complexity. Transactions on Machine Learning Research (TMLR), 2024.
  79. John Moody. The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. Advances in neural information processing systems, 4, 1991.
  80. Benign, tempered, or catastrophic: Toward a refined taxonomy of overfitting. Advances in Neural Information Processing Systems, 35:1182–1195, 2022.
  81. Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217, 2023.
  82. Brady Neal. On the bias-variance tradeoff: Textbooks need an update. arXiv preprint arXiv:1912.08286, 2019.
  83. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.
  84. A modern take on the bias-variance tradeoff in neural networks. arXiv preprint arXiv:1810.08591, 2018.
  85. What is being transferred in transfer learning? Advances in neural information processing systems, 33:512–523, 2020.
  86. Optimal regularization can mitigate double descent. arXiv preprint arXiv:2003.01897, 2020.
  87. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 7. Granada, Spain, 2011.
  88. Task arithmetic in the tangent space: Improved editing of pre-trained models. Advances in Neural Information Processing Systems, 36, 2024.
  89. What can linearized neural networks actually say about generalization? Advances in Neural Information Processing Systems, 34:8998–9010, 2021.
  90. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022.
  91. Simon JD Prince. Understanding Deep Learning. MIT press, 2023.
  92. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  93. Diverse weight averaging for out-of-distribution generalization. Advances in Neural Information Processing Systems, 35:10821–10836, 2022.
  94. A jamming transition from under-to over-parametrization affects loss landscape and generalization. arXiv preprint arXiv:1810.09665, 2018.
  95. Dissecting sample hardness: Fine-grained analysis of hardness characterization methods. In The Twelfth International Conference on Learning Representations, 2023.
  96. Model fusion via optimal transport. Advances in Neural Information Processing Systems, 33:22045–22055, 2020.
  97. Double descent demystified: Identifying, interpreting & ablating the sources of a deep learning puzzle. arXiv preprint arXiv:2303.14151, 2023.
  98. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon. arXiv preprint arXiv:2206.04817, 2022.
  99. Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 1995.
  100. Linear and nonlinear extension of the pseudo-inverse solution for learning boolean functions. Europhysics Letters, 9(4):315, 1989.
  101. Explaining grokking through circuit efficiency. arXiv preprint arXiv:2309.02390, 2023.
  102. Openml: networked science in machine learning. SIGKDD Explorations, 15(2):49–60, 2013.
  103. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pages 23965–23998. PMLR, 2022.
  104. Explaining the success of adaboost and random forests as interpolating classifiers. Journal of Machine Learning Research, 18(48):1–33, 2017.
  105. Taxonomizing local versus global structure in neural network loss landscapes. Advances in Neural Information Processing Systems, 34:18722–18733, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Alan Jeffares (9 papers)
  2. Alicia Curth (22 papers)
  3. Mihaela van der Schaar (321 papers)

Summary

Insights from Telescoping Models in Deep Learning Phenomena

The paper "Deep Learning Through A Telescoping Lens" by Jeffares, Curth, and van der Schaar introduces a simplified yet effective model for analyzing neural networks, providing empirical insights into several notable deep learning phenomena. The researchers suggest that viewing neural network training as a series of first-order approximations enables the construction of a telescoping model that not only closely approximates the behavior of trained networks but also serves as a tool for understanding complex behaviors observed in practice.

Telescoping Model Overview

The proposed telescoping model reimagines neural network training through a sequence of linear approximations at every training step, rather than treating the network's final parameters as a monolithic endpoint. This telescopic perspective offers a new way of reasoning about the evolutionary trajectory of a network during training, presenting testable hypotheses about neural architectures, optimization practices, and empirical behaviors that are difficult to detect through traditional means.

Key Phenomena Explored

  1. Double Descent Phenomenon: The paper revisits the double descent behavior in neural networks, a phenomenon wherein test performance improves with model complexity, worsens at certain capacity levels, and then surprisingly improves again with further increases. By applying the telescoping model, the authors leverage a complexity measure ps^0p^{0}_{\hat{\mathbf{s}}} that effectively deconstructs learned complexity into train- and test-time components. This decomposition illustrates quantifiable differences in complexity, thus clarifying the occurrence of benign overfitting in overparameterized networks.
  2. Grokking and Generalization: The model lends significant insight into grokking, where networks achieve perfect training accuracy long before notable improvements in test performance occur. The paper highlights that grokking arises with a notable divergence between effective parameters used for training and test data, indicating a quantifiable measure of benign overfitting. This aspect of the telescoping model points towards understanding the mechanistic process by which networks discover simpler generalizable solutions over extended training periods.
  3. Performance on Tabular Data vs. Gradient Boosting: The research addresses the purported underperformance of deep learning models compared to gradient boosted trees on tabular data and suggests that the intrinsic differences in kernel behavior — implicit to the neural tangent kernels versus the explicit tree kernels — could account for such observations. This line of inquiry reveals that neural networks' behavior can be unexpected under input irregularity, specifying a predictive viewpoint through the behavior of maximum kernel values during testing.
  4. Linear Mode Connectivity (LMC): Within optimization practices, the model aids in understanding linear mode connectivity (LMC), whereby neural networks trained from the same initialization can be averaged to yield equally performant models. The telescoping model frames this transition as associated with a stable regime of the tangent kernel, providing empirical evidence for when and why weight averaging aligns with ensemble predictions.

Implications and Future Directions

The implications of this paper are manifold. The telescoping model encourages a reevaluation of neural network complexity and function that accounts for dynamic training phenomena. The findings suggest potential advancements in optimization strategies, architecture design, and interpretability of neural networks. Future work could expand these insights to broader neural architectures and training regimes, particularly examining telescoping approximations' efficacy in large-scale models and diverse data modalities. This innovative analytical tool bridges an understanding of deep learning as a static entity to a dynamic, comprehensible process, highlighting the need for nuanced approaches in both empirical and theoretical domains of AI.

In conclusion, this paper exemplifies the emergence of novel methodologies in understanding deep learning, fostering deeper insights into established and emerging phenomena that govern model behavior, performance, and generalization. This work sets a robust foundation for future explorations into the surprise-laden behavior of neural networks across varied contexts.