Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Optimal Nonlinearities Improve Generalization Performance of Random Features (2309.16846v1)

Published 28 Sep 2023 in cs.LG and stat.ML

Abstract: Random feature model with a nonlinear activation function has been shown to perform asymptotically equivalent to a Gaussian model in terms of training and generalization errors. Analysis of the equivalent model reveals an important yet not fully understood role played by the activation function. To address this issue, we study the "parameters" of the equivalent model to achieve improved generalization performance for a given supervised learning problem. We show that acquired parameters from the Gaussian model enable us to define a set of optimal nonlinearities. We provide two example classes from this set, e.g., second-order polynomial and piecewise linear functions. These functions are optimized to improve generalization performance regardless of the actual form. We experiment with regression and classification problems, including synthetic and real (e.g., CIFAR10) data. Our numerical results validate that the optimized nonlinearities achieve better generalization performance than widely-used nonlinear functions such as ReLU. Furthermore, we illustrate that the proposed nonlinearities also mitigate the so-called double descent phenomenon, which is known as the non-monotonic generalization performance regarding the sample size and the model size.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Learning activation functions to improve deep neural networks. In International Conference on Learning Representations (Workshop), 2015.
  2. Generalization of two-layer neural networks: An asymptotic viewpoint. In International Conference on Learning Representations (ICLR), 2020.
  3. Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18(21):1–38, 2017.
  4. Fast and interpretable genomic data analysis using multiple approximate kernel learning. Bioinformatics, 38(Supplement_1):i77–i83, 2022.
  5. To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pages 541–549, 2018.
  6. Reconciling modern machine-learning practice and the classical bias-variance trade-off. In Proc. Natl. Acad. Sci., volume 116, pages 15849–15854, 2019. 10.1073/pnas.1903070116.
  7. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(2), 2012.
  8. Algorithms for hyper-parameter optimization. Advances in Neural Information Processing Systems, 24, 2011.
  9. Random fourier features for operator-valued kernels. In Asian Conference on Machine Learning, pages 110–125. PMLR, 2016.
  10. Towards interpreting deep neural networks via layer behavior understanding. Machine Learning (ACML 2021 - Journal Track), 111(3):1159–1179, 2022.
  11. Imagenet: A large-scale hierarchical image database. In IEEE Conf. Comp. Vis. Patt. Recogn, pages 248–255, 2009.
  12. A precise performance analysis of learning with random features. arXiv preprint arXiv:2008.11904, 2020.
  13. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
  14. Generalisation error in learning with random features and the hidden manifold model. In International Conference on Machine Learning, pages 3452–3462, 2020.
  15. The gaussian equivalence of generative models for learning with shallow neural networks. In Math. Sci. Mach Learn., pages 426–471, 2022.
  16. Universality laws for high-dimensional learning with random features. IEEE Trans. Inf. Theory, 69(3):1932–1964, Mar. 2023.
  17. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, pages 8580–8589, 2018.
  18. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
  19. Random features for kernel approximation: A survey on algorithms, theory, and beyond. IEEE Trans. Pattern Anal. Mach. Intell., 44(10):7128–7148, Oct. 2022. 10.1109/TPAMI.2021.3097011.
  20. Learning curves of generic features maps for realistic datasets with a teacher-student model. J. Stat. Mech. Theory Exp., 2022(11):114001, Nov. 2022. 10.1088/1742-5468/ac9825.
  21. The generalization error of random features regression: Precise asymptotics and the double descent curve. Commun. Pure Appl. Math., 75(4):667–766, 2022.
  22. Anisotropic random feature regression in high dimensions. In International Conference on Learning Representations, 2022.
  23. Spin Glass Theory And Beyond: An Introduction To The Replica Method And Its Applications. World Scientific, 1986.
  24. The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime. arXiv preprint arXiv:1911.01544, 2019.
  25. Optimal regularization can mitigate double descent. In International Conference on Learning Representations (ICLR), 2021.
  26. Random projection in neural episodic control. In Asian Conference on Machine Learning, pages 1–15. PMLR, 2019.
  27. Activation functions: Comparison of trends in practice and research for deep learning. arXiv preprint arXiv:1811.03378, 2018.
  28. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pages 1177–1184, 2007.
  29. Searching for activation functions. In International Conference on Learning Representations (Workshop), 2018.
  30. Variational inference with normalizing flows. In International Conference on Machine Learning, pages 1530–1538, 2015.
  31. On data-dependent random features for improved generalization in supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  32. Regularized linear regression: A precise analysis of the estimation error. In Proc. Conf. Learn. Theory, pages 1683–1709, 2015.
  33. Optimal activation functions for the random features regression model. In International Conference on Learning Representations, 2023.
  34. Understanding how over-parametrization leads to acceleration: A case of learning a single teacher neuron. In Asian Conference on Machine Learning, pages 17–32. PMLR, 2021.
  35. Tiny imagenet challenge. Technical report, 2017.
Citations (2)

Summary

We haven't generated a summary for this paper yet.