Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Function-Space Optimality of Neural Architectures With Multivariate Nonlinearities (2310.03696v2)

Published 5 Oct 2023 in stat.ML and cs.LG

Abstract: We investigate the function-space optimality (specifically, the Banach-space optimality) of a large class of shallow neural architectures with multivariate nonlinearities/activation functions. To that end, we construct a new family of Banach spaces defined via a regularization operator, the $k$-plane transform, and a sparsity-promoting norm. We prove a representer theorem that states that the solution sets to learning problems posed over these Banach spaces are completely characterized by neural architectures with multivariate nonlinearities. These optimal architectures have skip connections and are tightly connected to orthogonal weight normalization and multi-index models, both of which have received recent interest in the neural network community. Our framework is compatible with a number of classical nonlinearities including the rectified linear unit (ReLU) activation function, the norm activation function, and the radial basis functions found in the theory of thin-plate/polyharmonic splines. We also show that the underlying spaces are special instances of reproducing kernel Banach spaces and variation spaces. Our results shed light on the regularity of functions learned by neural networks trained on data, particularly with multivariate nonlinearities, and provide new theoretical motivation for several architectural choices found in practice.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. The merged-staircase property: A necessary and nearly sufficient condition for SGD learning of sparse functions on two-layer neural networks. In Conference on Learning Theory, pages 4782–4887. PMLR, 2022.
  2. Sorting out Lipschitz function approximation. In International Conference on Machine Learning, pages 291–301. PMLR, 2019.
  3. Multikernel regression with sparsity constraint. SIAM Journal on Mathematics of Data Science, 3(1):201–224, 2021.
  4. Francis Bach. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18(1):629–681, 2017.
  5. Understanding neural networks with reproducing kernel Banach spaces. Applied and Computational Harmonic Analysis, 62:194–236, 2023.
  6. On representer theorems and convex regularization. SIAM Journal on Optimization, 29(2):1260–1281, 2019.
  7. Sparsity of solutions for variational inverse problems with finite-dimensional data. Calculus of Variations and Partial Differential Equations, 59(1):Paper No. 14, 26, 2020.
  8. Capturing ridge functions in high dimensions from point queries. Constructive Approximation, 35:225–243, 2012.
  9. A new algorithm for estimating the effective dimension-reduction subspace. Journal of Machine Learning Research, 9:1647–1678, 2008.
  10. Carl de Boor and Robert E. Lynch. On splines and their minimum properties. Journal of Mathematics and Mechanics, 15(6):953–969, 1966.
  11. Weighted variation spaces and approximation by shallow ReLU networks. arXiv preprint arXiv:2307.15772, 2023.
  12. Jean Duchon. Splines minimizing rotation-invariant semi-norms in Sobolev spaces. In Constructive Theory of Functions of Several Variables, pages 85–100, Berlin, Heidelberg, 1977. Springer Berlin Heidelberg.
  13. Spline solutions to L1superscript𝐿1L^{1}italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT extremal problems in one and several variables. Journal of Approximation Theory, 13(1):73–83, 1975.
  14. Gerald B. Folland. Real analysis: Modern techniques and their applications. Pure and Applied Mathematics (New York). John Wiley & Sons, Inc., New York, second edition, 1999.
  15. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5(Jan):73–99, 2004.
  16. Generalized functions. Vol. 5: Integral geometry and representation theory. Academic Press, New York-London, 1966. Translated from Russian by Eugene Saletan.
  17. Generalized functions. Vol. I: Properties and operations. Academic Press, New York-London, 1964. Translated from Russian by Eugene Saletan.
  18. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 315–323, 2011.
  19. Fulton B. Gonzalez. On the range of the Radon d𝑑ditalic_d-plane transform and its dual. Transactions of the American Mathematical Society, 327(2):601–619, 1991.
  20. Maxout networks. In International Conference on Machine Learning, pages 1319–1327. PMLR, 2013.
  21. Learned-norm pooling for deep feedforward and recurrent neural networks. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2014, Nancy, France, September 15-19, 2014. Proceedings, Part I 14, pages 530–546. Springer, 2014.
  22. Orthogonal weight normalization: Solution to optimization over multiple dependent Stiefel manifolds in deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  23. Normalization techniques in training DNNs: Methodology, analysis and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  24. Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems, 31, 2018.
  25. Fritz Keinert. Inversion of k𝑘kitalic_k-plane transforms and applications in computer tomography. SIAM Review. A Publication of the Society for Industrial and Applied Mathematics, 31(2):273–298, 1989.
  26. Sandra Keiper. Approximation of generalized ridge functions in high dimensions. Journal of Approximation Theory, 245:101–129, 2019.
  27. Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33(1):82–95, 1971.
  28. Bounds on rates of variable-basis and neural-network approximation. IEEE Transactions on Information Theory, 47(6):2659–2665, 2001.
  29. Ker-Chau Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316–327, 1991.
  30. Preventing gradient attenuation in Lipschitz constrained convolutional networks. Advances in Neural Information Processing Systems, 32, 2019.
  31. On reproducing kernel Banach spaces: Generic definitions and unified framework of constructions. Acta Mathematica Sinica, English Series, 38(8):1459–1483, 2022.
  32. Andrew Markoe. Analytic tomography, volume 106 of Encyclopedia of Mathematics and its Applications. Cambridge University Press, Cambridge, 2006.
  33. Hrushikesh N. Mhaskar. On the tractability of multivariate integration and approximation by neural networks. Journal of Complexity, 20(4):561–590, 2004.
  34. Hrushikesh N. Mhaskar. Approximation by non-symmetric networks for cross-domain learning. arXiv preprint arXiv:2305.03890, 2023.
  35. Approximation by superposition of sigmoidal and radial basis functions. Advances in Applied Mathematics, 13(3):350–373, 1992.
  36. Degree of approximation by neural and translation networks with a single hidden layer. Advances in Applied Mathematics, 16(2):151–183, 1995.
  37. A function space view of bounded norm infinite width ReLU nets: The multivariate case. In Proceedings of the International Conference on Learning Representations, pages 1–24, 2020.
  38. The role of neural network activation functions. IEEE Signal Processing Letters, 27:1779–1783, 2020.
  39. Banach space representer theorems for neural networks and ridge splines. Journal of Machine Learning Research, 22:Paper No. 43, 40, 2021.
  40. What kinds of functions do deep neural networks learn? Insights from variational spline theory. SIAM Journal on Mathematics of Data Science, 4(2):464–489, 2022.
  41. Deep learning meets sparse regularization: A signal processing perspective. IEEE Signal Processing Magazine, 40(6):63–74, 2023.
  42. Near-minimax optimal estimation with shallow ReLU neural networks. IEEE Transactions on Information Theory, 69(2):1125–1140, 2023.
  43. Distributional extension and invertibility of the k𝑘kitalic_k-plane transform and its dual. arXiv preprint arXiv:2310.01233, 2023.
  44. Linear neural network layers promote learning single- and multiple-index models. arXiv preprint arXiv:2305.15598, 2023.
  45. Methods of Modern Mathematical Physics I: Functional analysis. Academic Press, 1972.
  46. ℓ1subscriptℓ1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT regularization in infinite dimensional feature spaces. In 20th Annual Conference on Learning Theory, pages 544–558. Springer, 2007.
  47. Boris Rubin. Inversion of k𝑘kitalic_k-plane transforms via continuous wavelet transforms. Journal of Mathematical Analysis and Applications, 220(1):187–203, 1998.
  48. Walter Rudin. Functional analysis. International Series in Pure and Applied Mathematics. McGraw-Hill, Inc., New York, second edition, 1991.
  49. Fractional integrals and derivatives. Gordon and Breach Science Publishers, Yverdon, 1993. Theory and applications, Edited and with a foreword by S. M. Nikol’skiĭ, Translated from the 1987 Russian original, Revised by the authors.
  50. How do infinite width bounded norm networks look in function space? In Conference on Learning Theory, pages 2667–2690, 2019.
  51. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Adaptive computation and machine learning. MIT Press, 2002.
  52. Vector-valued variation spaces and width bounds for DNNs: Insights on weight decay regularization. arXiv preprint arXiv:2305.16534, 2023.
  53. Sharp bounds on the approximation rates, metric entropy, and n𝑛nitalic_n-widths of shallow neural networks. Foundations of Computational Mathematics, pages 1–57, 2022.
  54. Characterization of the variation spaces corresponding to shallow neural networks. Constructive Approximation, pages 1–24, 2023.
  55. Practical and mathematical aspects of the problem of reconstructing objects from radiographs. Bulletin of the American Mathematical Society, 83(6):1227–1270, 1977.
  56. Donald C. Solmon. The X𝑋Xitalic_X-ray transform. Journal of Mathematical Analysis and Applications, 56(1):61–83, 1976.
  57. Duality for neural networks through reproducing kernel Banach spaces. arXiv preprint arXiv:2211.05020, 2022.
  58. Ingo Steinwart. Sparseness of support vector machines. Journal of Machine Learning Research, 4(Nov):1071–1105, 2003.
  59. François Trèves. Topological vector spaces, distributions and kernels. Academic Press, New York-London, 1967.
  60. Michael Unser. A unifying representer theorem for inverse problems and machine learning. Foundations of Computational Mathematics, 21(4):941–960, 2021.
  61. Michael Unser. From kernel methods to neural networks: A unifying variational formulation. Foundations of Computational Mathematics, 2023.
  62. Michael Unser. Ridges, neural networks, and the Radon transform. Journal of Machine Learning Research, 24:Paper No. 37, 33, 2023.
  63. Convex optimization in sums of Banach spaces. Applied and Computational Harmonic Analysis, 56:1–25, 2022.
  64. Splines are universal solutions of linear inverse problems with generalized TV regularization. SIAM Review, 59(4):769–793, 2017.
  65. Grace Wahba. Spline models for observational data. SIAM, 1990.
  66. Holger Wendland. Scattered data approximation, volume 17 of Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, Cambridge, 2005.
  67. Reproducing kernel Banach spaces for machine learning. Journal of Machine Learning Research, 10(12), 2009.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets