Papers
Topics
Authors
Recent
2000 character limit reached

Minimax optimality of deep neural networks on dependent data via PAC-Bayes bounds (2410.21702v2)

Published 29 Oct 2024 in stat.ML and cs.LG

Abstract: In a groundbreaking work, Schmidt-Hieber (2020) proved the minimax optimality of deep neural networks with ReLu activation for least-square regression estimation over a large class of functions defined by composition. In this paper, we extend these results in many directions. First, we remove the i.i.d. assumption on the observations, to allow some time dependence. The observations are assumed to be a Markov chain with a non-null pseudo-spectral gap. Then, we study a more general class of machine learning problems, which includes least-square and logistic regression as special cases. Leveraging on PAC-Bayes oracle inequalities and a version of Bernstein inequality due to Paulin (2015), we derive upper bounds on the estimation risk for a generalized Bayesian estimator. In the case of least-square regression, this bound matches (up to a logarithmic factor) the lower bound of Schmidt-Hieber (2020). We establish a similar lower bound for classification with the logistic loss, and prove that the proposed DNN estimator is optimal in the minimax sense.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Alquier, P. User-friendly introduction to PAC-Bayes bounds. Foundations and Trends® in Machine Learning 17, 2 (2024), 174–303.
  2. Prediction of time series by statistical learning: general losses and fast rates. Dependence Modeling 1, 2013 (2013), 65–93.
  3. Model selection for weakly dependent time series forecasting. Bernoulli 18, 3 (2012), 883–913.
  4. PAC-Bayes bounds on variational tempered posteriors for Markov models. Entropy 23, 3 (2021), 313.
  5. Convexity, classification, and risk bounds. Journal of the American Statistical Association 101, 473 (2006), 138–156.
  6. Deep learning. MIT press Cambridge, MA, USA, 2017.
  7. Castillo, I. Bayesian nonparametric statistics, St-Flour lecture notes. arXiv preprint arXiv:2402.16422 (2024).
  8. Posterior and variational inference for deep neural networks with heavy-tailed weights. arXiv preprint arXiv:2406.03369 (2024).
  9. Deep Horseshoe Gaussian Processes. arXiv preprint arXiv:2403.01737 (2024).
  10. Catoni, O. PAC-Bayesian supervised classification: The thermodynamics of statistical learning. Institute of Mathematical Statistics Lecture Notes – Monograph Series, 56. Institute of Mathematical Statistics, Beachwood, OH, 2007.
  11. Chérief-Abdellatif, B.-E. Convergence rates of variational inference in sparse deep learning. In International Conference on Machine Learning (2020), PMLR, pp. 1831–1842.
  12. A PAC-Bayes bound for deterministic classifiers. arXiv preprint arXiv:2209.02525 (2022).
  13. Wide stochastic networks: Gaussian limit and PAC-Bayesian training. In International Conference on Algorithmic Learning Theory (2023), PMLR, pp. 447–470.
  14. Markov Chains. Springer, 2018.
  15. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (2017).
  16. Hold-out estimates of prediction models for Markov processes. Statistics 57, 2 (2023), 458–481.
  17. Generalization bounds: Perspectives from information theory and PAC-Bayes. arXiv preprint arXiv:2309.04381 (2023).
  18. Mixing time estimation in reversible Markov chains from a single sample path. The Annals of Applied Probability 29, 4 (2019), 2439 – 2480.
  19. Kengne, W. Excess risk bound for deep learning under weak dependence. arXiv preprint arXiv:2302.07503 (2023).
  20. Penalized deep neural networks estimator with general loss functions under weak dependence. arXiv preprint arXiv:2305.06230 (2023).
  21. Deep learning for ψ𝜓\psiitalic_ψ-weakly dependent processes. Journal of Statistical Planning and Inference (2024), 106163.
  22. Sparse-penalized deep neural networks estimator under weak dependence. Metrika (2024), 1–32.
  23. An optimization-centric view on Bayes’ rule: Reviewing and generalizing variational inference. Journal of Machine Learning Research 23, 132 (2022), 1–109.
  24. On the rate of convergence of a deep recurrent neural network estimate in a regression problem with dependent data. Bernoulli 29, 2 (2023), 1663–1685.
  25. Adaptive deep learning for nonparametric time series regression. arXiv preprint arXiv:2207.02546 (2022).
  26. Estimating the spectral gap of a reversible markov chain from a short trajectory. arXiv preprint arXiv:1612.05330 (2016).
  27. Mai, T. T. Misclassification bounds for PAC-Bayesian sparse deep learning. arXiv preprint arXiv:2405.01304 (2024).
  28. McAllester, D. A. Some PAC-Bayesian theorems. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory (New York, 1998), ACM, pp. 230–234.
  29. Smooth function approximation by deep neural networks with general activation functions. Entropy 21, 7 (2019), 627.
  30. Nonconvex sparse regularization for deep neural networks and its optimality. Neural Computation 34, 2 (2022), 476–517.
  31. Paulin, D. Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electronic Journal of Probability 20 (2015), 1 – 32.
  32. Tighter risk certificates for neural networks. The Journal of Machine Learning Research 22, 1 (2021), 10326–10365.
  33. Posterior concentration for sparse deep learning. Advances in Neural Information Processing Systems 31 (2018).
  34. Schmidt-Hieber, J. Nonparametric regression using deep neural networks with ReLU activation function. The Annals of Statistics 48, 4 (2020), 1875–1897.
  35. PAC-Bayes training for neural networks: sparsity and uncertainty quantification. arXiv preprint arXiv:2204.12392 (2022).
  36. Learning sparse deep neural networks with a spike-and-slab prior. Statistics & probability letters 180 (2022), 109246.
  37. Suzuki, T. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality. 7th International Conference on Learning Representations (ICLR) (2019).
  38. Tsybakov, A. B. Introduction to nonparametric estimation. Springer Ser. Stat. New York, NY: Springer, 2009.
  39. Improved estimation of relaxation time in nonreversible Markov chains. The Annals of Applied Probability 34, 1A (2024), 249–276.
  40. Classification with deep neural networks and logistic loss. Journal of Machine Learning Research 25, 125 (2024), 1–117.

Summary

  • The paper relaxes the i.i.d. assumption by examining DNNs on Markov chain data with a non-null pseudo-spectral gap.
  • The methodology leverages advanced PAC-Bayes bounds and Bernstein’s inequality to derive near-optimal risk bounds.
  • The findings confirm that DNN estimators achieve minimax rates for both regression and classification under dependent data scenarios.

Minimax Optimality of Deep Neural Networks on Dependent Data via PAC-Bayes Bounds

This paper presents a significant extension to the theoretical understanding of deep neural networks (DNNs), specifically focusing on scenarios where observational data are not independently and identically distributed (i.i.d.) but instead arise from a Markov process with a non-null pseudo-spectral gap. The work builds upon previous research that established minimax optimality for DNNs utilizing ReLU activation functions within the framework of i.i.d. data for least-square regression tasks.

Key Contributions

The paper's primary contributions are twofold:

  1. Relaxation of the i.i.d. Assumption: By allowing observations to be drawn from a Markov chain, the authors significantly broaden the applicability of DNNs beyond traditional settings. The existence of a pseudo-spectral gap in the Markov chain is pivotal for addressing the dependencies in the data.
  2. Generalization Across Learning Problems: The paper extends its results to a more general class of machine learning tasks, encapsulating both regression and classification problems, such as least-square and logistic regression. Utilizing advanced PAC-Bayes bounds and a version of Bernstein's inequality, the researchers derive upper bounds on the estimation risk, contributing to the theoretical understanding of generalized Bayesian estimators.

Numerical Results and Theoretical Implications

The paper establishes that for least-square regression, the derived upper risk bounds align, up to a logarithmic factor, with the established minimax rates. Furthermore, akin bounds for logistic classification loss were derived, affirming the optimum of the proposed DNN estimator in the minimax sense. These findings substantiate the efficacy of the DNNs in achieving minimax rates across multiple scenarios while making bold theoretic claims regarding their performance on non-i.i.d. data.

PAC-Bayes Approach and Oracle Inequality

A noteworthy aspect is the use of PAC-Bayes bounds, which were instrumental in creating oracle inequalities and convergence rates for the estimators. These generalize to various dependency structures like Markov property and mixing conditions, infusing flexibility and robustness into the theoretical framework. This methodology underlies the derivation of optimal rates and serves as a bridge between statistical learning theory and neural networks.

Future Developments

The implications of this research resonate across both practical and theoretical domains. Practically, understanding DNN behavior on dependent data expands their usability in real-world settings where data dependencies are inevitable. Theoretically, the exploration of PAC-Bayes bounds in dependent settings enriches the fundamental learning theory, prompting further investigation into non-i.i.d. configurations.

Future research could explore further relaxing assumptions on dependency structures or extend the analysis to more complex neural architectures and tasks. This work paves the way for a deeper comprehension of learning dynamics in dependent data scenarios, potentially catalyzing innovations in fields where temporal or spatial data dependencies are prominent.

In conclusion, the paper provides a rigorous exploration of the performance and optimality of DNNs under dependent data settings, marking a significant stride in statistical learning theory applied to deep learning models.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 2 tweets with 43 likes about this paper.