Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Optimal Learning (2203.15994v2)

Published 30 Mar 2022 in cs.LG, cs.NA, math.NA, and stat.ML

Abstract: This paper studies the problem of learning an unknown function $f$ from given data about $f$. The learning problem is to give an approximation $\hat f$ to $f$ that predicts the values of $f$ away from the data. There are numerous settings for this learning problem depending on (i) what additional information we have about $f$ (known as a model class assumption), (ii) how we measure the accuracy of how well $\hat f$ predicts $f$, (iii) what is known about the data and data sites, (iv) whether the data observations are polluted by noise. A mathematical description of the optimal performance possible (the smallest possible error of recovery) is known in the presence of a model class assumption. Under standard model class assumptions, it is shown in this paper that a near optimal $\hat f$ can be found by solving a certain discrete over-parameterized optimization problem with a penalty term. Here, near optimal means that the error is bounded by a fixed constant times the optimal error. This explains the advantage of over-parameterization which is commonly used in modern machine learning. The main results of this paper prove that over-parameterized learning with an appropriate loss function gives a near optimal approximation $\hat f$ of the function $f$ from which the data is collected. Quantitative bounds are given for how much over-parameterization needs to be employed and how the penalization needs to be scaled in order to guarantee a near optimal recovery of $f$. An extension of these results to the case where the data is polluted by additive deterministic noise is also given.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Correcting for unknown errors in sparse high-dimensional function approximation. Numer. Math., 142(3):667–711, 2019.
  2. Deep neural networks are effective at learning high-dimensional Hilbert-valued functions from limited data. In J. Bruna, J. S. Hesthaven, and L. Zdeborova, editors,Proceedings of The Second Annual Conference on Mathematical and Scientific Machine Learning, 145:1–36, 2021.
  3. Sparse Polynomial Approximation of High-Dimensional Functions in Comput. Sci. Eng. Society for Industrial and Applied Mathematics. Philadelphia, 2022.
  4. L. Armijo. Minimization of functions having Lipschitz continuous first partial derivatives. Pacific Journal of mathematics, 16(1):1–3, 1966.
  5. A. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
  6. Square-root lasso:pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011.
  7. Simultaneous analysis of lasso and dantzig selector. Ann. Statist, 37(4):1705–1732, 2009.
  8. The group square-root lasso: Theoretical properties and fast algorithms. IEEE Transactions on Information Theory, 60(2):1313–1325, 2014.
  9. Compressed sensing and best k𝑘kitalic_k-term approximation. Journal of the American mathematical society, 22(1):211–231, 2009.
  10. On the stability and accuracy of least squares approximations. Foundations of computational mathematics, 13(5):819–834, 2013.
  11. Neural network approximation. Acta Numerica, 30:327–444, 2021.
  12. Data assimilation and sampling in Banach spaces. Calcolo, 54(3):963–1007, 2017.
  13. D. Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.
  14. Barron spaces and the compositional function spaces for neural network models. arXiv preprint arXiv:1906.08039, 2019.
  15. S. Foucart. The sparsity of LASSO-type minimizers. Appl. Comput. Harmon. Anal., 62:441–452, 2023.
  16. S. Foucart and H. Rauhut. An invitation to compressive sensing. In A mathematical introduction to compressive sensing, pages 1–39. Springer, 2013.
  17. B. Hanin and M. Nica. Finite depth and width corrections to the neural tangent kernel. arXiv preprint arXiv:1909.05989, 2019.
  18. Best subset, forward stepwise or lasso? Analysis and recommendations based on extensive comparisons. Statistical Science, 35(4):579–592, 2020.
  19. Statistical learning with sparsity: the lasso and generalizations. CRC press, 2015.
  20. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  21. Recovery of Sobolev functions restricted to iid sampling. arXiv preprint arXiv:2108.02055, 2021.
  22. D. Krieg and M. Sonnleitner. Random points are optimal for the approximation of Sobolev functions. arXiv preprint arXiv:2009.11275, 2020.
  23. D. Krieg and M. Ullrich. Function values are enough for L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-approximation. Foundations of Computational Mathematics, 21(4):1141–1151, 2021.
  24. Constructive approximation: advanced problems. Springer, 1996.
  25. N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist, 37(1):2246–2270, 2009.
  26. C. Micchelli and T. Rivlin. A survey of optimal recovery. Optimal estimation in approximation theory, pages 1–54, 1977.
  27. A new upper bound for sampling numbers. Foundations of Computational Mathematics, pages 1–24, 2021.
  28. Sobolev bounds on functions with scattered zeros, with applications to radial basis function surface fitting. Mathematics of Computation, 74(250):743–763, 2005.
  29. E. Novak and H. Wozniakowski. Tractability of multivariate problems. Vol. I: Linear information. European mathematical society, 2008.
  30. R. Parhi and R. Nowak. Banach space representer theorems for neural networks and ridge splines. J. Mach. Learn. Res., 22(43):1–40, 2021.
  31. H. Petersen and P. Jung. Robust instance-optimal recovery of sparse signals at unknown noise levels. Inf. Inference, 11:845–887, 2022.
  32. P. Petersen and F. Voigtlaender. Optimal learning of high-dimensional classification problems using deep neural networks. arXiv preprint arXiv:2112.12555, 2021.
  33. A. Pinkus. N-widths in Approximation Theory, volume 7. Springer Science & Business Media, 2012.
  34. J. Siegel and J. Xu. Characterization of the variation spaces corresponding to shallow neural networks. arXiv preprint arXiv:2106.15002, 2021.
  35. J. Siegel and J. Xu. Sharp bounds on the approximation rates, metric entropy, and n𝑛nitalic_n-widths of shallow neural networks. arXiv preprint arXiv:2101.12365, 2021.
  36. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  37. J. Traub and H. Wozniakowski. A general theory of optimal algorithms. Academic Press, 1980.
  38. M. Unser. A unifying representer theorem for inverse problems and machine learning. Foundations of Computational Mathematics, 21(4):941–960, 2021.
  39. S. van de Geer. Estimation and Testing Under Sparsity. Lecture Notes in Mathematics, Springer, 2016.
  40. F. Voigtlaender. Lpsuperscript𝐿𝑝{L}^{p}italic_L start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT sampling numbers for the Fourier-analytic Barron space. arXiv preprint arXiv:2208.07605v1, 2022.
  41. K. Yosida. Functional analysis. Springer Science & Business Media, 2012.

Summary

We haven't generated a summary for this paper yet.