2000 character limit reached
Approximation and Gradient Descent Training with Neural Networks (2405.11696v1)
Published 19 May 2024 in cs.LG
Abstract: It is well understood that neural networks with carefully hand-picked weights provide powerful function approximation and that they can be successfully trained in over-parametrized regimes. Since over-parametrization ensures zero training error, these two theories are not immediately compatible. Recent work uses the smoothness that is required for approximation results to extend a neural tangent kernel (NTK) optimization argument to an under-parametrized regime and show direct approximation bounds for networks trained by gradient flow. Since gradient flow is only an idealization of a practical method, this paper establishes analogous results for networks trained by gradient descent.
- B. Adcock and N. Dexter. The gap between theory and practice in function approximation with deep neural networks. SIAM Journal on Mathematics of Data Science, 3(2):624–655, 2021.
- A convergence theory for deep learning via over-parameterization. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, page 242–252, Long Beach, California, USA, 09–15 Jun 2019. PMLR. Full version available at https://arxiv.org/abs/1811.03962.
- Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, page 322–332, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
- On exact computation with an infinitely wide neural net. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- F. Bach. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18(19):1–53, 2017.
- Y. Bai and J. D. Lee. Beyond linearization: On quadratic and higher-order approximation of wide neural networks. In International Conference on Learning Representations, 2020.
- The Modern Mathematics of Deep Learning. In P. Grohs and G. Kutyniok, editors, Mathematical Aspects of Deep Learning, page 1–111. Cambridge University Press, 1 edition, Dec. 2022.
- A. Bietti and J. Mairal. On the inductive bias of neural tangent kernels. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- G. Bresler and D. Nagaraj. Sharp representation theorems for ReLU networks with precise dependence on depth. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, page 10697–10706. Curran Associates, Inc., 2020.
- L. Chen and S. Xu. Deep neural tangent kernel and laplace kernel have the same rkhs. In International Conference on Learning Representations, 2021.
- How much over-parameterization is sufficient to learn deep re{lu} networks? In International Conference on Learning Representations, 2021.
- On lazy training in differentiable programming. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Nonlinear Approximation and (Deep) ReLUReLU\mathrm{ReLU}roman_ReLU Networks. Constructive Approximation, 55(1):127–172, Feb. 2022.
- Neural network approximation. Acta Numerica, 30:327–444, 2021.
- S. Drews and M. Kohler. On the universal consistency of an over-parametrized deep neural network estimate learned by gradient descent, 2022. https://arxiv.org/abs/2208.14283.
- Gradient descent finds global minima of deep neural networks. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, page 1675–1685, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
- Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019.
- On the similarity between the laplace and neural tangent kernels. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, page 1451–1461. Curran Associates, Inc., 2020.
- R. Gentile and G. Welper. Approximation results for gradient descent trained shallow neural networks in 1d1𝑑1d1 italic_d, 2022. https://arxiv.org/abs/2209.08399.
- Approximation Spaces of Deep Neural Networks. Constructive Approximation, 55(1):259–367, Feb. 2022.
- P. Grohs and F. Voigtlaender. Proof of the Theory-to-Practice Gap in Deep Learning via Sampling Complexity bounds for Neural Network Approximation Spaces. Foundations of Computational Mathematics, July 2023.
- Error bounds for approximations with deep ReLU neural networks in ws,p norms. Analysis and Applications, 18(05):803–859, 2020.
- Constructive deep ReLU neural network approximation. Journal of Scientific Computing, 90(2):75, 2022.
- Convergence to good non-optimal critical points in the training of neural networks: Gradient descent optimization with one random initialization overcomes all bad non-global local minima with high probability, 2022. https://arxiv.org/abs/2212.13111.
- Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- A. Jentzen and A. Riekert. A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with relu activation for piecewise linear target functions. Journal of Machine Learning Research, 23(260):1–50, 2022.
- Z. Ji and M. Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks. In International Conference on Learning Representations, 2020.
- K. Kawaguchi and J. Huang. Gradient descent finds global minima for generalizable deep neural networks of practical sizes. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), page 92–99, 2019.
- Approximation by combinations of ReLU and squared ReLU ridge functions with ℓ1\ell{}^{1}roman_ℓ start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT and ℓ0\ell{}^{0}roman_ℓ start_FLOATSUPERSCRIPT 0 end_FLOATSUPERSCRIPT controls. IEEE Transactions on Information Theory, 64(12):7649–7656, 2018.
- M. Kohler and A. Krzyzak. Analysis of the rate of convergence of an over-parametrized deep neural network estimate learned by gradient descent, 2022. https://arxiv.org/abs/2210.01443.
- Neural tangent kernel analysis of deep narrow neural networks. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, page 12282–12351. PMLR, 17–23 Jul 2022.
- Wide neural networks of any depth evolve as linear models under gradient descent. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Better approximations of high dimensional smooth functions by deep neural networks with rectified power units. Communications in Computational Physics, 27(2):379–411, 2019.
- Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, page 8157–8166. Curran Associates, Inc., 2018.
- Complexity measures for neural networks with general activation functions using path-based norms, 2020. https://arxiv.org/abs/2009.06132.
- Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021.
- Q. N. Nguyen and M. Mondelli. Global convergence of deep networks with one wide layer followed by pyramidal topology. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, page 11961–11972. Curran Associates, Inc., 2020.
- Deep ReLU networks and high-order finite element methods. Analysis and Applications, 18(05):715–770, 2020.
- S. Oymak and M. Soltanolkotabi. Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1(1):84–105, 2020.
- A. Pinkus. Approximation theory of the mlp model in neural networks. Acta Numerica, 8:143–195, 1999.
- Nonlinear approximation via compositions. Neural Networks, 119:74–84, 2019.
- Greedy training algorithms for neural networks and applications to PDEs. Journal of Computational Physics, 484:112084, July 2023.
- J. W. Siegel and J. Xu. Approximation rates for neural networks with general activation functions. Neural Networks, 128:313–321, 2020.
- J. W. Siegel and J. Xu. High-order approximation rates for shallow neural networks with cosine and ReLUksuperscriptReLU𝑘\text{ReLU}^{k}ReLU start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT activation functions. Applied and Computational Harmonic Analysis, 58:1–26, 2022.
- J. W. Siegel and J. Xu. Optimal convergence rates for the orthogonal greedy algorithm. IEEE Transactions on Information Theory, 68(5):3354–3361, 2022.
- Subquadratic overparameterization for shallow neural networks. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, page 11247–11259. Curran Associates, Inc., 2021.
- Z. Song and X. Yang. Quadratic suffices for over-parametrization via matrix chernoff bound, 2019. https://arxiv.org/abs/1906.03593.
- L. Su and P. Yang. On learning over-parameterized neural networks: A functional approximation perspective. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- T. Suzuki. Adaptivity of deep reLU network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality. In International Conference on Learning Representations, 2019.
- R. Vershynin. High-dimensional probability: an introduction with applications in data science. Number 47 in Cambridge series in statistical and probabilistic mathematics. Cambridge University Press, Cambridge ; New York, NY, 2018.
- Towards a mathematical understanding of neural network-based machine learning: What we know and what we don’t. CSIAM Transactions on Applied Mathematics, 1(4):561–615, 2020.
- The Barron Space and the Flow-Induced Function Spaces for Neural Network Models. Constructive Approximation, 55(1):369–406, Feb. 2022.
- G. Welper. Approximation results for gradient flow trained neural networks, 2023. Accepted for publication in Journal of Machine Learning, https://arxiv.org/abs/2309.04860.
- D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Networks, 94:103–114, 2017.
- D. Yarotsky. Optimal approximation of continuous functions by very deep ReLU networks. In S. Bubeck, V. Perchet, and P. Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, page 639–649. PMLR, 06–09 Jul 2018.
- D. Yarotsky and A. Zhevnerchuk. The phase diagram of approximation rates for deep neural networks. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, page 13005–13015. Curran Associates, Inc., 2020.
- Gradient descent optimizes over-parameterized deep ReLU networks. Machine Learning, 109(3):467 – 492, 2020.
- D. Zou and Q. Gu. An improved analysis of training over-parameterized deep neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.