Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling MLPs: A Tale of Inductive Bias (2306.13575v3)

Published 23 Jun 2023 in cs.LG

Abstract: In this work we revisit the most fundamental building block in deep learning, the multi-layer perceptron (MLP), and study the limits of its performance on vision tasks. Empirical insights into MLPs are important for multiple reasons. (1) Given the recent narrative "less inductive bias is better", popularized due to transformers eclipsing convolutional models, it is natural to explore the limits of this hypothesis. To that end, MLPs offer an ideal test bed, as they lack any vision-specific inductive bias. (2) MLPs have almost exclusively been the main protagonist in the deep learning theory literature due to their mathematical simplicity, serving as a proxy to explain empirical phenomena observed for more complex architectures. Surprisingly, experimental datapoints for MLPs are very difficult to find in the literature, especially when coupled with large pre-training protocols. This discrepancy between practice and theory is worrying: Do MLPs reflect the empirical advances exhibited by practical models? Or do theorists need to rethink the role of MLPs as a proxy? We provide insights into both these aspects. We show that the performance of MLPs drastically improves with scale (95% on CIFAR10, 82% on CIFAR100, 58% on ImageNet ReaL), highlighting that lack of inductive bias can indeed be compensated. We observe that MLPs mimic the behaviour of their modern counterparts faithfully, with some components in the learning setting however exhibiting stronger or unexpected behaviours. Due to their inherent computational efficiency, large pre-training experiments become more accessible for academic researchers. All of our experiments were run on a single GPU.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (83)
  1. Learning and generalization in overparameterized neural networks, going beyond two layers. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  2. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252. PMLR.
  3. Amari, S. (1967). A theory of adaptive pattern classifiers. IEEE Transactions on Electronic Computers, EC-16(3):299–307.
  4. The curious case of benign memorization. In The Eleventh International Conference on Learning Representations.
  5. On the optimization of deep networks: Implicit acceleration by overparameterization. In International Conference on Machine Learning, pages 244–253. PMLR.
  6. On exact computation with an infinitely wide neural net. In Neural Information Processing Systems.
  7. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning.
  8. Layer normalization.
  9. Explaining neural scaling laws.
  10. Relational inductive biases, deep learning, and graph networks.
  11. Are we done with imagenet?
  12. Globally optimal gradient descent for a ConvNet with Gaussian inputs. In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 605–614. PMLR.
  13. Broken neural scaling laws. In The Eleventh International Conference on Learning Representations.
  14. Emerging properties in self-supervised vision transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9630–9640.
  15. CycleMLP: A MLP-like architecture for dense prediction. In International Conference on Learning Representations.
  16. A simple framework for contrastive learning of visual representations. In III, H. D. and Singh, A., editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1597–1607. PMLR.
  17. Symbolic discovery of optimization algorithms.
  18. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Abernethy, J. and Agarwal, S., editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 1305–1338. PMLR.
  19. A downsampled variant of imagenet as an alternative to the cifar datasets.
  20. An analysis of single-layer networks in unsupervised feature learning. In Gordon, G., Dunson, D., and Dudík, M., editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 215–223, Fort Lauderdale, FL, USA. PMLR.
  21. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255.
  22. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
  23. Finding the needle in the haystack with convolutions: on the benefits of architectural bias. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  24. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations.
  25. Convolution by evolution: Differentiable pattern producing networks. In Proceedings of the Genetic and Evolutionary Computation Conference 2016, GECCO ’16, page 109–116, New York, NY, USA. Association for Computing Machinery.
  26. Accurate, large minibatch sgd: Training imagenet in 1 hour.
  27. Bootstrap your own latent - a new approach to self-supervised learning. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 21271–21284. Curran Associates, Inc.
  28. Implicit bias of gradient descent on linear convolutional networks. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
  29. Hire-mlp: Vision mlp via hierarchical rearrangement. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 816–826.
  30. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778.
  31. Beyond human-level accuracy: Computational challenges in deep learning.
  32. Deep learning scaling is predictable, empirically.
  33. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  34. Training compute-optimal large language models.
  35. Infinite attention: NNGP and NTK for deep attention networks. In III, H. D. and Singh, A., editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4376–4386. PMLR.
  36. Cybernetic Predicting Devices. JPRS 37, 803. Joint Publications Research Service [available from the Clearinghouse for Federal Scientific and Technical Information].
  37. Neural tangent kernel: Convergence and generalization in neural networks. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
  38. Scaling laws for neural language models.
  39. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations.
  40. Krizhevsky, A. (2009). Learning multiple layers of features from tiny images.
  41. Imagenet classification with deep convolutional neural networks. In Pereira, F., Burges, C., Bottou, L., and Weinberger, K., editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc.
  42. Tiny imagenet visual recognition challenge.
  43. FFCV: Accelerating training by removing data bottlenecks.
  44. Convergence analysis of two-layer neural networks with relu activation. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  45. AS-MLP: An axial shifted MLP architecture for vision. In International Conference on Learning Representations.
  46. How far can we go without convolution: Improving fully-connected networks. ArXiv, abs/1511.02580.
  47. Pay attention to MLPs. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W., editors, Advances in Neural Information Processing Systems.
  48. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986.
  49. A solvable model of neural scaling laws.
  50. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75.
  51. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671.
  52. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM, 65(1):99–106.
  53. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications, 9.
  54. Neyshabur, B. (2020). Towards learning convolutions from scratch. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 8078–8088. Curran Associates, Inc.
  55. The role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations.
  56. In search of the real inductive bias: On the role of implicit regularization in deep learning.
  57. OpenAI (2018). Ai and compute.
  58. OpenAI (2023). Gpt-4 technical report.
  59. What kinds of functions do deep neural networks learn? insights from variational spline theory. SIAM J. Math. Data Sci., 4:464–489.
  60. Linear neural network layers promote learning single- and multiple-index models.
  61. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
  62. Exponential expressivity in deep neural networks through transient chaos. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
  63. Imagenet-21k pretraining for the masses. In Vanschoren, J. and Yeung, S., editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1. Curran.
  64. The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks. Cambridge University Press.
  65. Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386.
  66. A constructive prediction of the generalization error across scales. In International Conference on Learning Representations.
  67. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks.
  68. Deep information propagation. In International Conference on Learning Representations.
  69. Vector-valued variation spaces and width bounds for dnns: Insights on weight decay regularization. ArXiv, abs/2305.16534.
  70. The implicit bias of gradient descent on separable data. In International Conference on Learning Representations.
  71. Efficientnet: Rethinking model scaling for convolutional neural networks.
  72. MLP-mixer: An all-MLP architecture for vision. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W., editors, Advances in Neural Information Processing Systems.
  73. ResMLP: Feedforward networks for image classification with data-efficient training.
  74. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  75. Patches are all you need?
  76. Do deep convolutional nets really need to be deep and convolutional? In International Conference on Learning Representations.
  77. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  78. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272.
  79. Large batch training of convolutional networks.
  80. Metaformer is actually what you need for vision.
  81. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12104–12113.
  82. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations.
  83. Gradient descent optimizes over-parameterized deep relu networks. Machine Learning, 109:1–26.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Gregor Bachmann (21 papers)
  2. Sotiris Anagnostidis (21 papers)
  3. Thomas Hofmann (121 papers)
Citations (32)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com