Emergent Mind

Trade-off Between Dependence and Complexity for Nonparametric Learning -- an Empirical Process Approach

(2401.08978)
Published Jan 17, 2024 in math.ST , stat.ML , and stat.TH

Abstract

Empirical process theory for i.i.d. observations has emerged as a ubiquitous tool for understanding the generalization properties of various statistical problems. However, in many applications where the data exhibit temporal dependencies (e.g., in finance, medical imaging, weather forecasting etc.), the corresponding empirical processes are much less understood. Motivated by this observation, we present a general bound on the expected supremum of empirical processes under standard $\beta/\rho$-mixing assumptions. Unlike most prior work, our results cover both the long and the short-range regimes of dependence. Our main result shows that a non-trivial trade-off between the complexity of the underlying function class and the dependence among the observations characterizes the learning rate in a large class of nonparametric problems. This trade-off reveals a new phenomenon, namely that even under long-range dependence, it is possible to attain the same rates as in the i.i.d. setting, provided the underlying function class is complex enough. We demonstrate the practical implications of our findings by analyzing various statistical estimators in both fixed and growing dimensions. Our main examples include a comprehensive case study of generalization error bounds in nonparametric regression over smoothness classes in fixed as well as growing dimension using neural nets, shape-restricted multivariate convex regression, estimating the optimal transport (Wasserstein) distance between two probability distributions, and classification under the Mammen-Tsybakov margin condition -- all under appropriate mixing assumptions. In the process, we also develop bounds on $L_r$ ($1\le r\le 2$)-localized empirical processes with dependent observations, which we then leverage to get faster rates for (a) tuning-free adaptation, and (b) set-structured learning problems.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Sign up for a free account or log in to generate a summary of this paper:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. Advances in neural information processing systems 30.
  2. Fast learning rates for plug-in classifiers. The Annals of Statistics 35 608 – 633. https://doi.org/10.1214/009053606000001217

  3. Maximum likelihood estimation of monotone and concave production frontiers. Journal of Productivity Analysis 3 401–415.
  4. Adaptive estimation in autoregression or-mixing regression via model selection. The Annals of Statistics 29 839–875.
  5. Generalization bounds for nonparametric regression with $β-$mixing samples
  6. Bass, R. F. (1985). Law of the iterated logarithm for set-indexed partial sum processes with finite variance. Z. Wahrsch. Verw. Gebiete 70 591–608. https://doi.org/10.1007/BF00531869

  7. On deep learning as a remedy for the curse of dimensionality in nonparametric regression. The Annals of Statistics 47 2261–2285.
  8. Berbee, H. C. P. (1979). Random walks with stationary increments and renewal theory, vol. 112 of Mathematical Centre Tracts. Mathematisch Centrum, Amsterdam.
  9. Bernstein, S. (1924). On a modification of Chebyshev’s inequality and of the error formula of Laplace. Ann. Sci. Inst. Sav. Ukraine, Sect. Math 1 38–49.
  10. On parameter estimation with the Wasserstein distance
  11. A Glivenko-Cantelli theorem for exchangeable random variables. Statistics & probability letters 32 385–391.
  12. Deep Neural Networks for Nonparametric Interaction Models with Diverging Dimension
  13. Rates of convergence for minimum contrast estimators. Probab. Theory Related Fields 97 113–150. https://doi.org/10.1007/BF01199316

  14. On the strong law of large numbers for a class of stochastic processes. Sandia Corporation.
  15. Multivariate Brenier cumulative distribution functions and their application to non-parametric testing
  16. Robust nonparametric regression estimation for dependent observations. The Annals of Statistics 17 1242–1256.
  17. Displacement interpolation using Lagrangian mass transport. In Proceedings of the 2011 SIGGRAPH Asia conference.
  18. Bousquet, O. (2002). A bennett concentration inequality and its application to suprema of empirical processes. Comptes Rendus Mathematique 334 495–500.
  19. Bradley, R. C. (1985). On the central limit question under absolute regularity. The Annals of Probability 13 1314–1325.
  20. Bradley, R. C. (1989). A stationary, pairwise independent, absolutely regular sequence for which the central limit theorem fails. Probab. Theory Related Fields 81 1–10. https://doi.org/10.1007/BF00343735

  21. Bradley, R. C. (2005). Basic properties of strong mixing conditions. a survey and some open questions .
  22. Information regularity and the central limit question. Rocky Mountain J. Math. 13 77–97. https://doi.org/10.1216/RMJ-1983-13-1-77

  23. Brenier, Y. (1991). Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics 44 375–417.
  24. Time series: theory and methods. Springer science & business media.
  25. Bronshtein, E. M. (1976). ε𝜀\varepsilonitalic_ε-entropy of convex sets and functions. Siberian Mathematical Journal 17 393–398.
  26. Linear smoothers and additive models. The Annals of Statistics 453–510.
  27. Convergence of entropic schemes for optimal transport and gradient flows. SIAM Journal on Mathematical Analysis 49 1385–1418.
  28. On matrix estimation under monotonicity constraints. Bernoulli 24 1072 – 1100. https://doi.org/10.3150/16-BEJ865

  29. Concentration inequalities for empirical processes of linear time series. J. Mach. Learn. Res. 18 Paper No. 231, 46.
  30. Faster Wasserstein distance estimation with the Sinkhorn divergence. Advances in Neural Information Processing Systems 33 2257–2269.
  31. Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems 26.
  32. Davydov, J. A. (1969). The strong mixing property for Markov chains with a countable number of states. Dokl. Akad. Nauk SSSR 187 252–254.
  33. Davydov, J. A. (1973). Mixing conditions for Markov chains. Teor. Verojatnost. i Primenen. 18 321–338.
  34. On Learnability under General Stochastic Processes
  35. Rates of estimation of optimal transport maps using plug-in estimators via barycentric projections. Advances in Neural Information Processing Systems 34 29736–29753.
  36. Wasserstein Mirror Gradient Flow as the limit of the Sinkhorn Algorithm
  37. Multivariate rank-based distribution-free nonparametric testing using measure transportation. Journal of the American Statistical Association 118 192–207.
  38. Coupling for τ𝜏\tauitalic_τ-dependent sequences and applications. Journal of Theoretical Probability 17 861–885.
  39. Empirical process techniques for dependent data. Birkhäuser Boston, Inc., Boston, MA. https://doi.org/10.1007/978-1-4612-0099-4

  40. A probabilistic theory of pattern recognition, vol. 31. Springer Science & Business Media.
  41. Doukhan, P. (2012). Mixing: Properties and Examples. Lecture Notes in Statistics, Springer New York. https://books.google.co.in/books?id=KFXmBwAAQBAJ

  42. The functional central limit theorem for strongly mixing processes. In Annales de l’IHP Probabilités et statistiques, vol. 30.
  43. Invariance principles for absolutely regular empirical processes. In Annales de l’IHP Probabilités et statistiques, vol. 31.
  44. Dudley, R. M. (2014). Uniform central limit theorems, vol. 142. Cambridge university press.
  45. Learning classifiers when the training data is not i.i.d. In IJCAI, vol. 2007. San Francisco, CA.
  46. Computational optimal transport: Complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In International conference on machine learning. PMLR.
  47. Selfsimilar processes. Princeton Series in Applied Mathematics, Princeton University Press, Princeton, NJ.
  48. Factor augmented sparse throughput deep ReLU neural networks for high dimensional regression. Journal of the American Statistical Association 1–28.
  49. Direct estimation of low-dimensional components in additive models. The Annals of Statistics 26 943–971.
  50. Regularized least-squares regression: learning from a β𝛽\betaitalic_β-mixing sequence. J. Statist. Plann. Inference 142 493–505. https://doi.org/10.1016/j.jspi.2011.08.007

  51. Deep neural networks for estimation and inference. Econometrica 89 181–213.
  52. On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theory Related Fields 162 707–738. https://doi.org/10.1007/s00440-014-0583-7

  53. Learning with a Wasserstein loss. Advances in neural information processing systems 28.
  54. Learning generative models with Sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics. PMLR.
  55. Concentration inequalities and asymptotic results for ratio type empirical processes. The Annals of Probability 34 1143 – 1216. https://doi.org/10.1214/009117906000000070

  56. Mathematical foundations of infinite-dimensional statistical models. Cambridge university press.
  57. Diffeomorphic matching of distributions: A new approach for unlabelled point-sets and sub-manifolds matching. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., vol. 2. IEEE.
  58. Statistical inference with regularized optimal transport
  59. Goldstein, S. (1979). Maximal coupling. Z. Wahrsch. Verw. Gebiete 46 193–204. https://doi.org/10.1007/BF00533259

  60. Granger, C. W. (1980). Long memory relationships and the aggregation of dynamic models. Journal of econometrics 14 227–238.
  61. Estimation of a convex function: characterizations and asymptotic theory. The Annals of Statistics 29 1653–1698.
  62. Lower Complexity Adaptation for Empirical Entropic Optimal Transport
  63. Guégan, D. (2005). How can we define the concept of long memory? An econometric survey. Econometric Rev. 24 113–149. https://doi.org/10.1081/ETC-200067887

  64. L1 covering numbers for uniformly bounded convex functions. In Conference on Learning Theory. JMLR Workshop and Conference Proceedings.
  65. Global risk bounds and adaptation in univariate convex regression. Probability Theory and Related Fields 163 379–411.
  66. Han, Q. (2021). Set structured global empirical risk minimizers are rate optimal in general dimensions. The Annals of Statistics 49 2642–2671.
  67. Isotonic regression in general dimensions. The Annals of Statistics 47 2440 – 2471. https://doi.org/10.1214/18-AOS1753

  68. Multivariate convex regression: global risk bounds and adaptation
  69. Hanneke, S. (2021). Learning whenever learning is possible: Universal learning under general stochastic processes. The Journal of Machine Learning Research 22 5751–5866.
  70. Statistical learning under nonstationary mixing processes. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR.
  71. Hansen, B. E. (2008). Uniform convergence rates for kernel estimation with dependent data. Econometric Theory 24 726–748.
  72. Consistency in concave regression. The Annals of Statistics 1038–1050.
  73. Generalized additive models: some applications. Journal of the American Statistical Association 82 371–386.
  74. Hildreth, C. (1954). Point estimates of ordinates of concave functions. Journal of the American Statistical Association 49 598–619.
  75. Rate-optimal estimation for a general class of nonparametric regression models with unknown link functions
  76. Empirical Optimal Transport between Different Measures Adapts to Lower Complexity
  77. Hurst, H. E. (1951). Long-term storage capacity of reservoirs. Transactions of the American society of civil engineers 116 770–799.
  78. Ibragimov, I. (1959). Some limit theorems for stochastic processes stationary in the strict sense. In Dokl. Akad. Nauk SSSR, vol. 125.
  79. Irle, A. (1997). On consistency in nonparametric estimation under mixing conditions. Journal of multivariate analysis 60 123–147.
  80. Robust local polynomial regression for dependent data. Statistica Sinica 705–722.
  81. A shortest augmenting path algorithm for dense and sparse linear assignment problems. In DGOR/NSOR: Papers of the 16th Annual Meeting of DGOR in Cooperation with NSOR/Vorträge der 16. Jahrestagung der DGOR zusammen mit der NSOR. Springer.
  82. Excess risk bound for deep learning under weak dependence
  83. Deep learning for $ψ$-weakly dependent processes
  84. Sparse-penalized deep neural networks estimator under weak dependence
  85. Adaptive regression estimation with multilayer feedforward neural networks. Nonparametric Statistics 17 891–913.
  86. Nonparametric regression based on hierarchical interaction models. IEEE Transactions on Information Theory 63 1620–1630.
  87. On the rate of convergence of fully connected deep neural network regression estimates. The Annals of Statistics 49 2231–2249.
  88. On strong mixing conditions for stationary Gaussian processes. Theory of Probability & Its Applications 5 204–208.
  89. Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics 34 2593 – 2656. https://doi.org/10.1214/009053606000001019

  90. Sparsity in multiple kernel learning. The Annals of Statistics 38 3660–3695.
  91. Krebs, J. T. (2018). A large deviation inequality for β𝛽\betaitalic_β-mixing time series and its applications to the functional kernel regression model. Statistics & Probability Letters 133 50–58.
  92. Kuosmanen, T. (2008). Representation theorem for convex nonparametric least squares. The Econometrics Journal 11 308–325.
  93. Optimality of Maximum Likelihood for Log-Concave Density Estimation and Bounded Convex Regression
  94. Convex Regression in Multidimensions: Suboptimality of Least Squares Estimators
  95. Adaptive deep learning for nonlinear time series models
  96. Generalization bounds for time series prediction with non-stationary processes. In Algorithmic learning theory, vol. 8776 of Lecture Notes in Comput. Sci. Springer, Cham, 260–274. https://doi.org/10.1007/978-3-319-11662-4_19

  97. Stochastic modelling of riverflow time series. Journal of the Royal Statistical Society: Series A (General) 140 1–31.
  98. Liang, T. (2021). How well generative adversarial networks learn distributions. The Journal of Machine Learning Research 22 10366–10406.
  99. Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics 34 2272–2297.
  100. Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis 53 5465–5506.
  101. Differential properties of Sinkhorn approximation for learning with Wasserstein distance. Advances in Neural Information Processing Systems 31.
  102. Theoretical analysis of deep neural networks for temporally dependent observations. In Advances in Neural Information Processing Systems (A. H. Oh, A. Agarwal, D. Belgrave and K. Cho, eds.). https://openreview.net/forum?id=wN1CBFFx7JF

  103. Smooth discrimination analysis. The Annals of Statistics 27 1808–1829.
  104. Sharp Convergence Rates for Empirical Optimal Transport with Smooth Costs
  105. Risk bounds for statistical learning. The Annals of Statistics 34 2326 – 2366. https://doi.org/10.1214/009053606000000786

  106. Matzkin, R. L. (1991). Semiparametric estimation of monotone and concave utility functions for polychotomous choice models. Econometrica: Journal of the Econometric Society 1315–1327.
  107. McCann, R. J. (1995). Existence and uniqueness of monotone measure-preserving maps. Duke Mathematical Journal 80 309 – 323. https://doi.org/10.1215/S0012-7094-95-08013-2

  108. Estimating beta-mixing coefficients via histograms
  109. Preservation of the rescaled adjusted range: 1. a reassessment of the hurst phenomenon. Water Resources Research 14 491–508.
  110. Statistical bounds for entropic optimal transport: sample complexity and the central limit theorem. Advances in neural information processing systems 32.
  111. Mendelson, S. (2003). A few notes on statistical learning theory. In Advanced Lectures on Machine Learning: Machine Learning Summer School 2002 Canberra, Australia, February 11–22, 2002 Revised Lectures. Springer, 1–40.
  112. Mendelson, S. (2015). Learning without concentration. Journal of the ACM (JACM) 62 1–25.
  113. Learning in Implicit Generative Models
  114. Stability bounds for stationary φ𝜑\varphiitalic_φ-mixing and β𝛽\betaitalic_β-mixing processes. J. Mach. Learn. Res. 11 789–814.
  115. A note on uniform laws of averages for dependent processes. Statistics & Probability Letters 17 169–172.
  116. Nobel, A. B. (1999). Limits to classification and regression estimation from ergodic processes. The Annals of Statistics 27 262–273.
  117. Optimal transport for stationary Markov chains via policy iteration. J. Mach. Learn. Res. 23 Paper No. [45], 52.
  118. Ossiander, M. (1987). A central limit theorem under metric entropy with L2subscript𝐿2L{2}italicL startPOSTSUBSCRIPT 2 endPOSTSUBSCRIPT bracketing. Ann. Probab. 15 897–919. http://links.jstor.org/sici?sici=0091-1798(198707)15:3897:ACLTUM2.0.CO;2-Z&origin=MSN

  119. Pollard, D. (2002). Maximal inequalities via bracketing with adaptive truncation. In Annales de l’Institut Henri Poincare (B) Probability and Statistics, vol. 38. Elsevier.
  120. Entropic estimation of optimal transport maps
  121. Long range dependence in heavy tailed stochastic processes. In Handbook of Heavy Tailed Distributions in Finance (S. T. Rachev, ed.), vol. 1 of Handbooks in Finance. North-Holland, Amsterdam, 641–662. https://www.sciencedirect.com/science/article/pii/B9780444508966500182

  122. Minimax-optimal rates for sparse additive models over kernel classes via convex programming. Journal of machine learning research 13.
  123. Sparse additive models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 1009–1030.
  124. Rio, E. (1993). Covariance inequalities for strongly mixing processes. In Annales de l’IHP Probabilités et statistiques, vol. 29.
  125. Rosenblatt, M. (1956). A central limit theorem and a strong mixing condition. Proceedings of the National Academy of Sciences of the United States of America 42 43.
  126. Roussas, G. G. (1990). Nonparametric regression estimation under mixing conditions. Stochastic Process. Appl. 36 107–116. https://doi.org/10.1016/0304-4149(90)90045-T

  127. On empirical risk minimization with dependent and heavy-tailed data. Advances in Neural Information Processing Systems 34 8913–8926.
  128. Additive models with trend filtering. The Annals of Statistics 47 3032–3068.
  129. Samorodnitsky, G. (2016). Stochastic processes and long range dependence. Springer Series in Operations Research and Financial Engineering, Springer, Cham. https://doi.org/10.1007/978-3-319-45575-4

  130. Santambrogio, F. (2015). Optimal transport for applied mathematicians. Birkäuser, NY 55 94.
  131. Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with ReLU activation function. The Annals of Statistics 48 1875–1897.
  132. Schütt, C. (1984). Entropy numbers of diagonal operators between symmetric banach spaces. Journal of approximation theory 40 121–128.
  133. Nonparametric least squares estimation of a multivariate convex regression function. The Annals of Statistics 39 1633 – 1657. https://doi.org/10.1214/10-AOS852

  134. Nonparametric density estimation under adversarial losses. Advances in Neural Information Processing Systems 31.
  135. On integral probability metrics, φ-divergences and binary classification
  136. Learning from dependent observations. Journal of Multivariate Analysis 100 175–194.
  137. Stone, C. J. (1985). Additive regression and other nonparametric models. The annals of Statistics 13 689–705.
  138. Stone, C. J. (1994). The use of polynomial splines and their tensor products in multivariate function estimation. The annals of statistics 22 118–171.
  139. Talagrand, M. (1996). New concentration inequalities in product spaces. Inventiones mathematicae 126 505–563.
  140. Doubly penalized estimation in additive regression with high-dimensional data. The Annals of Statistics 47 2567–2600.
  141. Truong, L. V. (2022). Generalization error bounds on deep learning with Markov datasets. Advances in Neural Information Processing Systems 35 23452–23462.
  142. Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. The Annals of Statistics 32 135–166.
  143. Vaart, A. W. v. d. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press.
  144. Van de Geer, S. A. (2000). Applications of empirical process theory, vol. 6 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge.
  145. Empirical Processes in M-estimation, vol. 6. Cambridge university press.
  146. Weak convergence. Springer.
  147. Vapnik, V. (1999). The nature of statistical learning theory. Springer science & business media.
  148. Varian, H. R. (1982). The nonparametric approach to demand analysis. Econometrica: Journal of the Econometric Society 945–973.
  149. Vidyasagar, M. (2003). Learning and generalization. 2nd ed. Communications and Control Engineering Series, Springer-Verlag London, Ltd., London. With applications to neural networks. https://doi.org/10.1007/978-1-4471-3748-1

  150. Villani, C. (2009). Optimal transport, vol. 338 of Grundlehren der mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin. Old and new. https://doi.org/10.1007/978-3-540-71050-9

  151. Some limit theorems for random functions. i. Theory of Probability & Its Applications 4 178–197.
  152. Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint, vol. 48. Cambridge University Press.
  153. Wu, W. B. (2005). Nonlinear system theory: Another look at dependence. Proceedings of the National Academy of Sciences 102 14150–14154.
  154. Faithful variable screening for high-dimensional convex regression. The Annals of Statistics 44 2624 – 2660. https://doi.org/10.1214/15-AOS1425

  155. Predicting cell lineages using autoencoders and optimal transport. PLoS computational biology 16 e1007828.
  156. The phase diagram of approximation rates for deep neural networks. Advances in neural information processing systems 33 13005–13015.
  157. Yu, B. (1994). Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability 94–116.
  158. Minimax optimal rates of estimation in high dimensional additive models. The Annals of Statistics 44 2564–2593.

Show All 158