Optimal Learning (2203.15994v2)
Abstract: This paper studies the problem of learning an unknown function $f$ from given data about $f$. The learning problem is to give an approximation $\hat f$ to $f$ that predicts the values of $f$ away from the data. There are numerous settings for this learning problem depending on (i) what additional information we have about $f$ (known as a model class assumption), (ii) how we measure the accuracy of how well $\hat f$ predicts $f$, (iii) what is known about the data and data sites, (iv) whether the data observations are polluted by noise. A mathematical description of the optimal performance possible (the smallest possible error of recovery) is known in the presence of a model class assumption. Under standard model class assumptions, it is shown in this paper that a near optimal $\hat f$ can be found by solving a certain discrete over-parameterized optimization problem with a penalty term. Here, near optimal means that the error is bounded by a fixed constant times the optimal error. This explains the advantage of over-parameterization which is commonly used in modern machine learning. The main results of this paper prove that over-parameterized learning with an appropriate loss function gives a near optimal approximation $\hat f$ of the function $f$ from which the data is collected. Quantitative bounds are given for how much over-parameterization needs to be employed and how the penalization needs to be scaled in order to guarantee a near optimal recovery of $f$. An extension of these results to the case where the data is polluted by additive deterministic noise is also given.
- Correcting for unknown errors in sparse high-dimensional function approximation. Numer. Math., 142(3):667–711, 2019.
- Deep neural networks are effective at learning high-dimensional Hilbert-valued functions from limited data. In J. Bruna, J. S. Hesthaven, and L. Zdeborova, editors,Proceedings of The Second Annual Conference on Mathematical and Scientific Machine Learning, 145:1–36, 2021.
- Sparse Polynomial Approximation of High-Dimensional Functions in Comput. Sci. Eng. Society for Industrial and Applied Mathematics. Philadelphia, 2022.
- L. Armijo. Minimization of functions having Lipschitz continuous first partial derivatives. Pacific Journal of mathematics, 16(1):1–3, 1966.
- A. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
- Square-root lasso:pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011.
- Simultaneous analysis of lasso and dantzig selector. Ann. Statist, 37(4):1705–1732, 2009.
- The group square-root lasso: Theoretical properties and fast algorithms. IEEE Transactions on Information Theory, 60(2):1313–1325, 2014.
- Compressed sensing and best k𝑘kitalic_k-term approximation. Journal of the American mathematical society, 22(1):211–231, 2009.
- On the stability and accuracy of least squares approximations. Foundations of computational mathematics, 13(5):819–834, 2013.
- Neural network approximation. Acta Numerica, 30:327–444, 2021.
- Data assimilation and sampling in Banach spaces. Calcolo, 54(3):963–1007, 2017.
- D. Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.
- Barron spaces and the compositional function spaces for neural network models. arXiv preprint arXiv:1906.08039, 2019.
- S. Foucart. The sparsity of LASSO-type minimizers. Appl. Comput. Harmon. Anal., 62:441–452, 2023.
- S. Foucart and H. Rauhut. An invitation to compressive sensing. In A mathematical introduction to compressive sensing, pages 1–39. Springer, 2013.
- B. Hanin and M. Nica. Finite depth and width corrections to the neural tangent kernel. arXiv preprint arXiv:1909.05989, 2019.
- Best subset, forward stepwise or lasso? Analysis and recommendations based on extensive comparisons. Statistical Science, 35(4):579–592, 2020.
- Statistical learning with sparsity: the lasso and generalizations. CRC press, 2015.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Recovery of Sobolev functions restricted to iid sampling. arXiv preprint arXiv:2108.02055, 2021.
- D. Krieg and M. Sonnleitner. Random points are optimal for the approximation of Sobolev functions. arXiv preprint arXiv:2009.11275, 2020.
- D. Krieg and M. Ullrich. Function values are enough for L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-approximation. Foundations of Computational Mathematics, 21(4):1141–1151, 2021.
- Constructive approximation: advanced problems. Springer, 1996.
- N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist, 37(1):2246–2270, 2009.
- C. Micchelli and T. Rivlin. A survey of optimal recovery. Optimal estimation in approximation theory, pages 1–54, 1977.
- A new upper bound for sampling numbers. Foundations of Computational Mathematics, pages 1–24, 2021.
- Sobolev bounds on functions with scattered zeros, with applications to radial basis function surface fitting. Mathematics of Computation, 74(250):743–763, 2005.
- E. Novak and H. Wozniakowski. Tractability of multivariate problems. Vol. I: Linear information. European mathematical society, 2008.
- R. Parhi and R. Nowak. Banach space representer theorems for neural networks and ridge splines. J. Mach. Learn. Res., 22(43):1–40, 2021.
- H. Petersen and P. Jung. Robust instance-optimal recovery of sparse signals at unknown noise levels. Inf. Inference, 11:845–887, 2022.
- P. Petersen and F. Voigtlaender. Optimal learning of high-dimensional classification problems using deep neural networks. arXiv preprint arXiv:2112.12555, 2021.
- A. Pinkus. N-widths in Approximation Theory, volume 7. Springer Science & Business Media, 2012.
- J. Siegel and J. Xu. Characterization of the variation spaces corresponding to shallow neural networks. arXiv preprint arXiv:2106.15002, 2021.
- J. Siegel and J. Xu. Sharp bounds on the approximation rates, metric entropy, and n𝑛nitalic_n-widths of shallow neural networks. arXiv preprint arXiv:2101.12365, 2021.
- R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
- J. Traub and H. Wozniakowski. A general theory of optimal algorithms. Academic Press, 1980.
- M. Unser. A unifying representer theorem for inverse problems and machine learning. Foundations of Computational Mathematics, 21(4):941–960, 2021.
- S. van de Geer. Estimation and Testing Under Sparsity. Lecture Notes in Mathematics, Springer, 2016.
- F. Voigtlaender. Lpsuperscript𝐿𝑝{L}^{p}italic_L start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT sampling numbers for the Fourier-analytic Barron space. arXiv preprint arXiv:2208.07605v1, 2022.
- K. Yosida. Functional analysis. Springer Science & Business Media, 2012.