Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Neural Redshift: Random Networks are not Random Functions (2403.02241v2)

Published 4 Mar 2024 in cs.LG, cs.AI, and cs.CV

Abstract: Our understanding of the generalization capabilities of neural networks (NNs) is still incomplete. Prevailing explanations are based on implicit biases of gradient descent (GD) but they cannot account for the capabilities of models from gradient-free methods nor the simplicity bias recently observed in untrained networks. This paper seeks other sources of generalization in NNs. Findings. To understand the inductive biases provided by architectures independently from GD, we examine untrained, random-weight networks. Even simple MLPs show strong inductive biases: uniform sampling in weight space yields a very biased distribution of functions in terms of complexity. But unlike common wisdom, NNs do not have an inherent "simplicity bias". This property depends on components such as ReLUs, residual connections, and layer normalizations. Alternative architectures can be built with a bias for any level of complexity. Transformers also inherit all these properties from their building blocks. Implications. We provide a fresh explanation for the success of deep learning independent from gradient-based training. It points at promising avenues for controlling the solutions implemented by trained models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (90)
  1. Generalization on the unseen, logic reasoning and degree curriculum. arXiv preprint arXiv:2301.13105, 2023.
  2. Implicit regularization in deep matrix factorization. NeurIPS, 32, 2019.
  3. A closer look at memorization in deep networks. In ICML, pages 233–242. PMLR, 2017.
  4. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  5. Scaling MLPs: A tale of inductive bias. arXiv preprint arXiv:2306.13575, 2023.
  6. The shattered gradients problem: If resnets are the answer, then what is the question? In International Conference on Machine Learning, pages 342–350. PMLR, 2017.
  7. Simplicity bias in transformers and their ability to learn sparse boolean functions. arXiv preprint arXiv:2211.12316, 2022.
  8. Model-agnostic measure of generalization difficulty. In ICML, pages 2857–2884. PMLR, 2023.
  9. Loss landscapes are all you need: Neural network generalization can be explained without the implicit bias of gradient descent. In ICLR, 2022.
  10. The spectral bias of polynomial neural networks. arXiv preprint arXiv:2202.13473, 2022.
  11. Inductive bias of deep convolutional networks through pooling geometry. arXiv preprint arXiv:1605.06743, 2016.
  12. Random deep neural networks are biased towards simple functions. NeurIPS, 32, 2019.
  13. Neural networks and the chomsky hierarchy. arXiv preprint arXiv:2207.02098, 2022.
  14. Why neural networks find simple solutions: The many regularizers of geometric complexity. NeurIPS, 35:2333–2349, 2022.
  15. Input–output maps are strongly biased towards simple outputs. Nature communications, 9(1):761, 2018.
  16. Activation functions in deep learning: A comprehensive survey and benchmark. Neurocomputing, 2022.
  17. The role of permutation invariance in linear mode connectivity of neural networks. arXiv preprint arXiv:2110.06296, 2021.
  18. Initial guessing bias: How untrained networks favor some classes. arXiv preprint arXiv:2306.00809, 2023.
  19. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269. PMLR, 2020.
  20. Implicit bias in leaky relu networks trained on high-dimensional data. arXiv preprint arXiv:2210.07082, 2022.
  21. A theoretical analysis of the repetition problem in text generation. In AAAI, pages 12848–12856, 2021.
  22. Gallant. There exists a neural network that does not make avoidable mistakes. In IEEE International Conference on Neural Networks, pages 657–664. IEEE, 1988.
  23. Deep convolutional networks as shallow gaussian processes. arXiv preprint arXiv:1808.05587, 2018.
  24. Stochastic training is not necessary for generalization. arXiv preprint arXiv:2109.14119, 2021.
  25. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020.
  26. Alan Martin Gilkes. Photograph enhancement by adaptive digital unsharp masking. PhD thesis, Massachusetts Institute of Technology, 1974.
  27. Deep neural networks with random gaussian weights: A universal classification strategy? IEEE Transactions on Signal Processing, 64(13):3444–3457, 2016.
  28. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, pages 249–256. JMLR, 2010.
  29. The no free lunch theorem, kolmogorov complexity, and the role of inductive biases in machine learning. arXiv preprint arXiv:2304.05366, 2023.
  30. Sensitivity as a complexity measure for sequence classification tasks. Transactions of the ACL, 9:891–908, 2021.
  31. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  32. What shapes feature representations? exploring datasets, architectures, and training. arXiv preprint arXiv:2006.12433, 2020.
  33. On the foundations of shortcut learning. arXiv preprint arXiv:2310.16228, 2023.
  34. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
  35. On the activation function dependence of the spectral bias of neural networks. arXiv preprint arXiv:2208.04924, 2022.
  36. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
  37. Understanding generalization through visualizations. In ”I Can’t Believe It’s Not Better!” NeurIPS Workshop, pages 87–97. PMLR, 2020.
  38. The low-rank simplicity bias in deep networks. arXiv preprint arXiv:2103.10427, 2021.
  39. Multiplicative interactions and where to find them. In ICLR, 2020.
  40. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
  41. Gradient descent on two-layer nets: Margin maximization and simplicity bias. NeurIPS, 34:12978–12991, 2021.
  42. Rectifier nonlinearities improve neural network acoustic models. In ICML, page 3. Atlanta, GA, 2013.
  43. Gaussian process behaviour in wide deep neural networks. In ICLR, 2018.
  44. Neural networks are a priori biased towards boolean functions with low entropy. arXiv preprint arXiv:1909.11522, 2019.
  45. Tom M Mitchell. The need for biases in learning generalizations. Rutgers University CS tech report CBM-TR-117, 1980.
  46. Special properties of gradient descent with large learning rates. arXiv preprint arXiv:2205.15142, 2023.
  47. On the number of linear regions of deep neural networks. NeurIPS, 27, 2014.
  48. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
  49. Hidden unit specialization in layered neural networks: Relu vs. sigmoidal activation. Physica A: Statistical Mechanics and its Applications, 564:125517, 2021.
  50. The emergence of spectral universality in deep networks. In International Conference on Artificial Intelligence and Statistics, pages 1924–1932. PMLR, 2018.
  51. Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity. NeurIPS, 34:29218–29230, 2021.
  52. Gradient starvation: A learning proclivity in neural networks. NeurIPS, 34:1256–1272, 2021.
  53. Theory of deep learning III: the non-overfitting puzzle. CBMM Memo, 73:1–38, 2018.
  54. Exponential expressivity in deep neural networks through transient chaos. NeurIPS, 29, 2016.
  55. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  56. On the expressive power of deep neural networks. In ICML, pages 2847–2854. PMLR, 2017.
  57. On the spectral bias of neural networks. In ICML, pages 5301–5310. PMLR, 2019.
  58. Beyond periodicity: Towards a unifying framework for activations in coordinate-MLPs. In ECCV, pages 142–158. Springer, 2022.
  59. How you start matters for generalization. arXiv preprint arXiv:2206.08558, 2022a.
  60. On the frequency-bias of coordinate-MLPs. NeurIPS, 35:796–809, 2022b.
  61. Wire: Wavelet implicit neural representations. In CVPR, pages 18507–18516, 2023.
  62. Jürgen Schmidhuber. Discovering neural nets with low kolmogorov complexity and high generalization capability. Neural Networks, 10(5):857–873, 1997.
  63. Deep information propagation. arXiv preprint arXiv:1611.01232, 2016.
  64. Which shortcut cues will dnns choose? a study from the parameter-space perspective. arXiv preprint arXiv:2110.03095, 2021.
  65. The pitfalls of simplicity bias in neural networks. NeurIPS, 33:9573–9585, 2020.
  66. Failures of gradient-based deep learning. In ICML, pages 3067–3075. PMLR, 2017.
  67. Reverse engineering the neural tangent kernel. In ICML, pages 20215–20231. PMLR, 2022.
  68. Implicit neural representations with periodic activation functions. NeurIPS, 33:7462–7473, 2020.
  69. On the origin of implicit regularization in stochastic gradient descent. arXiv preprint arXiv:2101.12176, 2021.
  70. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
  71. Property unlearning: A defense strategy against property inference attacks. arXiv preprint arXiv:2205.08821, 2022.
  72. Richard Sutton. The bitter lesson. Incomplete Ideas (blog), 13(1), 2019.
  73. On the learning dynamics of deep neural networks. arXiv preprint arXiv:1809.06848, 2018.
  74. Fourier features let networks learn high frequency functions in low dimensional domains. NeurIPS, 33:7537–7547, 2020.
  75. Evading the simplicity bias: Training a diverse set of models discovers solutions with superior ood generalization. arXiv preprint arXiv:2105.05612, 2021.
  76. Predicting is not understanding: Recognizing and addressing underspecification in machine learning. In European Conference on Computer Vision, pages 458–476. Springer, 2022.
  77. Good classifiers are abundant in the interpolating regime. In AISTATS, pages 3376–3384. PMLR, 2021.
  78. Deep image prior. In CVPR, pages 9446–9454, 2018.
  79. Deep learning generalizes because the parameter-function map is biased towards simple functions. arXiv preprint arXiv:1805.08522, 2018.
  80. Gal Vardi. On the implicit bias in deep-learning algorithms. Communications of the ACM, 66(6):86–93, 2023.
  81. Neural fields in visual computing and beyond. In Computer Graphics Forum, pages 641–676. Wiley Online Library, 2022.
  82. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pages 10524–10533. PMLR, 2020.
  83. Understanding and improving layer normalization. Advances in Neural Information Processing Systems, 32, 2019a.
  84. Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523, 2019b.
  85. Training behavior of deep neural network in frequency domain. In ICONIP, pages 264–274. Springer, 2019c.
  86. A fine-grained spectral perspective on neural networks. arXiv preprint arXiv:1907.10599, 2019.
  87. Instilling inductive biases with subnetworks. arXiv preprint arXiv:2310.10899, 2023a.
  88. Why shallow networks struggle with approximating and learning high frequency: A numerical study. arXiv preprint arXiv:2306.17301, 2023b.
  89. Permutation equivariant neural functionals. arXiv preprint arXiv:2302.14040, 2023a.
  90. What algorithms can transformers learn? a study in length generalization. arXiv preprint arXiv:2310.16028, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Damien Teney (43 papers)
  2. Armand Nicolicioiu (2 papers)
  3. Valentin Hartmann (11 papers)
  4. Ehsan Abbasnejad (59 papers)
Citations (11)

Summary

Examining the Inductive Biases of Neural Networks through the Lens of Random-Weight Functions

Introduction

The quest to understand the factors contributing to the generalization capabilities of neural networks (NNs) has led to a considerable body of research. Traditionally, much of this effort has been centered on examining the implicit biases of gradient descent as the primary mechanism of learning. However, recent studies challenge this view, suggesting that other factors intrinsic to the neural architectures might play a role in their ability to generalize from limited data. This paper contributes to this discussion by shifting the focus towards the inherent properties of neural network architectures, independent of the learning algorithm employed.

Inductive Biases in Random-Weight Networks

A pivotal part of our investigation involves the paper of neural networks initialized with random weights, henceforth referred to as random-weight networks. Contrary to the common intuition that these networks would exhibit behavior akin to random functions, our analyses reveal that even when uninitialized, neural networks exhibit strong inductive biases. These biases manifest as a tendency of the networks to represent functions of a certain level of complexity, which does not necessarily align with the notion of "simplicity bias" often attributed to neural networks. Our findings indicate that the complexity preference of neural networks is not a universal trait but is significantly influenced by architectural components such as activation functions, residual connections, and layer normalizations.

We employ a variety of complexity measures including Fourier decomposition, polynomial decomposition, and Lempel-Ziv (LZ) complexity to rigorously analyze the inductive biases of neural networks. Through this multi-faceted approach, we uncover that while networks with ReLU activations and those incorporating residual connections or layer normalization are inclined towards generating functions of lower complexity, the bias towards simplicity is not a foregone conclusion for all architectures.

Implications for Deep Learning

Our research provides fresh insights into the success of deep learning, suggesting that it is not solely reliant on gradient-based optimization methods. By elucidating how certain architectural choices predispose networks towards functions of a particular complexity, we unveil avenues for controlling the generalization behavior of trained models. This understanding underscores the importance of architectural design in deep learning and challenges the conventional wisdom surrounding the role of gradient descent in the generalization capabilities of neural networks.

Towards a Future of Tailored Complexity Bias

The notion that neural networks' parameter space is inherently biased towards functions of certain complexities opens up the potential for deliberate manipulation of these biases to suit specific tasks. By adjusting architectural elements such as activation functions and the magnitude of weights, we demonstrate that it's feasible to modulate the complexity bias of a network. This capability to tailor the inductive bias of neural networks could prove instrumental in tackling tasks where a mismatch exists between the complexity of the target function and the inherent bias of the network architecture.

Relevance to Transformer Models

In extending our analysis to transformer-based sequence models, we observe that transformers inherit the complexity biases of their constituent components. This realization not only reinforces the importance of architectural considerations in the design of neural models but also offers a fresh perspective on the observed tendencies of transformers, such as their predilection for generating simple, repetitive sequences.

Conclusion

In sum, this work takes significant strides in broadening our comprehension of the factors that drive the generalization abilities of neural networks. By focusing on the intrinsic biases of neural architectures, independent from the peculiarities of the optimization process, we provide a nuanced understanding of why certain architectural configurations excel in practice. The implications of our findings extend beyond theoretical interest, offering practical guidance for the design of neural networks tailored to the complexities of the tasks they are intended to solve.

HackerNews