Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Novel Kernel Models and Exact Representor Theory for Neural Networks Beyond the Over-Parameterized Regime (2405.15254v1)

Published 24 May 2024 in stat.ML, cs.AI, and cs.LG

Abstract: This paper presents two models of neural-networks and their training applicable to neural networks of arbitrary width, depth and topology, assuming only finite-energy neural activations; and a novel representor theory for neural networks in terms of a matrix-valued kernel. The first model is exact (un-approximated) and global, casting the neural network as an elements in a reproducing kernel Banach space (RKBS); we use this model to provide tight bounds on Rademacher complexity. The second model is exact and local, casting the change in neural network function resulting from a bounded change in weights and biases (ie. a training step) in reproducing kernel Hilbert space (RKHS) in terms of a local-intrinsic neural kernel (LiNK). This local model provides insight into model adaptation through tight bounds on Rademacher complexity of network adaptation. We also prove that the neural tangent kernel (NTK) is a first-order approximation of the LiNK kernel. Finally, and noting that the LiNK does not provide a representor theory for technical reasons, we present an exact novel representor theory for layer-wise neural network training with unregularized gradient descent in terms of a local-extrinsic neural kernel (LeNK). This representor theory gives insight into the role of higher-order statistics in neural network training and the effect of kernel evolution in neural-network kernel models. Throughout the paper (a) feedforward ReLU networks and (b) residual networks (ResNet) are used as illustrative examples.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, 1972.
  2. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242–252. PMLR, 2019.
  3. Aronszajn, N. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337–404, Jan–Jun 1950.
  4. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pp. 322–332. PMLR, 2019a.
  5. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pp. 8139–8148, 2019b.
  6. Bach, F. On the equivalence between kernel quadrature rules and random feature expansions. The Journal of Machine Learning Research, 18(1):714–751, 2017.
  7. Bach, F. R. Breaking the curse of dimensionality with convex neural networks. CoRR, abs/1412.8690, 2014. URL http://arxiv.org/abs/1412.8690.
  8. Beyond linearization: On quadratic and higher-order approximation of wide neural networks. arXiv preprint arXiv:1910.01619, 2019.
  9. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002.
  10. Local rademacher complexities. 2005.
  11. Understanding neural networks with reproducing kernel banach spaces. arXiv preprint arXiv:2109.09710, 2021.
  12. Understanding neural networks with reproducing kernel banach spaces. Applied and Computational Harmonic Analysis, 62:194–236, January 2023.
  13. An exact kernel equivalence for finite classification models. In TAG-ML, 2023.
  14. Boyd, J. P. The rate of convergence of hermite function series. Mathematics of Computation, 35(152):1309–1316, October 1980.
  15. Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in neural information processing systems, volume 32, 2019.
  16. Optimal rates for the regularized least-squares algorithm. Found Comput Math, 7:331–368, 2007.
  17. Reproducing kernel hilbert spaces and mercer theorem. Technical Report arvXiv:math.FA/0504071, arXiv, 2005.
  18. Kernel methods for deep learning. In Y., B., D., S., D., L. J., Williams, C. K. I., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 22, pp. 342–350. Curran Associates, Inc., 2009. URL http://papers.nips.cc/paper/3628-kernel-methods-for-deep-learning.pdf.
  19. Learning kernels using local rademacher complexity. Advances in neural information processing systems, 26, 2013.
  20. Methods of Mathematical Physics. John Wiley and sons, New York, 1937.
  21. Daniely, A. Sgd learns the conjugate kernel class of the network. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp.  2422–2430. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/6836-sgd-learns-the-conjugate-kernel-class-of-the-network.pdf.
  22. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 2253–2261. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/6427-toward-deeper-understanding-of-neural-networks-the-power-of-initialization-and-a-dual-view-on-expressivity.pdf.
  23. Large-margin classification in banach spaces. In Proceedings of the JMLR Workshop and Conference 2: AISTATS2007, pp.  91–98, 2007.
  24. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pp. 1675–1685. PMLR, 2019a.
  25. Gradient descent provably optimizes over-parameterized neural networks. In Conference on Learning Representations, 2019b.
  26. Deep convolutional networks as shallow gaussian processes. In International Conference on Learning Representations, pp. 1–16, May 2019.
  27. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp.  315–323. JMLR Workshop and Conference Proceedings, 2011.
  28. Table of Integrals, Series, and Products. Academic Press, London, 2000.
  29. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  30. Hille, E. Contributions to the theory of Hermitian series. II. The representation problem. Trans. Amer. Math. Soc., 47:80–94, 1940.
  31. Lora: Low-rank adaptation of large language models, 2021.
  32. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571–8580, 2018.
  33. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302 – 1338, 2000.
  34. Deep neural networks as gaussian processes. In In International Conference on Learning Representations, 2018.
  35. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32, 2019.
  36. On reproducing kernel banach spaces: Generic definitions and unified framework of constructions. Acta Mathematica Sinica, English Series, 2022.
  37. Gaussian process behaviour in wide deep neural networks. arXiv e-prints, 2018.
  38. Mercer, J. Functions of positive and negative type, and their connection with the theory of integral equations. Transactions of the Royal Society of London, 209(A), 1909.
  39. On learning vector-valued functions. Neural computation, 17(1):177–204, 2005.
  40. Methods of Theoretical Physics. McGraw-Hill, 1953.
  41. Neal, R. M. Priors for infinite networks, pp.  29–53. Springer, 1996.
  42. Bayesian deep convolutional networks with many channels are gaussian processes. In International Conference on Learning Representations, ICLR 2019, 2019.
  43. NIST Handbook of Mathematical Functions. Cambridge University Press, USA, 1st edition, 2010. ISBN 0521140633.
  44. Banach space representer theorems for neural networks and ridge splines. J. Mach. Learn. Res., 22(43):1–40, 2021.
  45. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems 21, pp. 1313–1320. Curran Associates, Inc., 2009.
  46. Learning equivariant functions with matrix valued kernels. Journal of Machine Learning Research, 8(15):385–408, 2007.
  47. Saitoh, S. Norm inequalities in nonlinear transforms. ???, 1027:75–85, 1998.
  48. Sanders, K. Neural networks as functions parameterized by measures: Representer theorems and approximation benefits. Master’s thesis, Eindhoven University of Technology, 2020.
  49. Schwartz, L. Sous-espaces hilbertiens d’espaces vectoriels topologiques et noyaux associés (noyaux reproduisants). Journal d’analyse mathématique, 13:115–256, 1964.
  50. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.
  51. Gradient descent in neural networks as sequential learning in reproducing kernel banach space. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  31435–31488. PMLR, 23–29 Jul 2023.
  52. Reproducing kernel banach spaces with the ℓ1superscriptℓ1\ell^{1}roman_ℓ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT norm. Applied and Computational Harmonic Analysis, 34(1):96–116, Jan 2013.
  53. Duality for neural networks through reproducing kernel banach spaces. arXiv preprint arXiv:2211.05020, 2022.
  54. Learning in hilbert vs. banach spaces: A measure embedding viewpoint. In Advances in Neural Information Processing Systems, pp. 1773–1781, 2011.
  55. Support Vector Machines. Springer, 2008.
  56. Unser, M. A representer theorem for deep neural networks. J. Mach. Learn. Res., 20(110):1–30, 2019.
  57. Unser, M. A unifying representer theorem for inverse problems and machine learning. Foundations of Computational Mathematics, 21(4):941–960, 2021.
  58. A priori estimates of the population risk for two-layer neural networks. Communications in Mathematical Sciences, 17(5):1407–1425, 2019.
  59. Generalized mercer kernels and reproducing kernel banach spaces. arXiv preprint arXiv:1412.8663, 2014.
  60. Regularized learning in banach spaces as an optimization problem: representer theorems. Journal of Global Optimization, 54(2):235–250, Oct 2012.
  61. Reproducing kernel banach spaces for machine learning. Journal of Machine Learning Research, 10:2741–2775, 2009.
  62. An improved analysis of training over-parameterized deep neural networks. In Advances in neural information processing systems, volume 32, 2019.
  63. Gradient descent optimizes over-parameterized deep relu networks. Machine learning, 109(3):467–492, 2020.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com