Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Local Risk Bounds for Statistical Aggregation (2306.17151v1)

Published 29 Jun 2023 in math.ST, cs.IT, cs.LG, math.IT, stat.ML, and stat.TH

Abstract: In the problem of aggregation, the aim is to combine a given class of base predictors to achieve predictions nearly as accurate as the best one. In this flexible framework, no assumption is made on the structure of the class or the nature of the target. Aggregation has been studied in both sequential and statistical contexts. Despite some important differences between the two problems, the classical results in both cases feature the same global complexity measure. In this paper, we revisit and tighten classical results in the theory of aggregation in the statistical setting by replacing the global complexity with a smaller, local one. Some of our proofs build on the PAC-Bayes localization technique introduced by Catoni. Among other results, we prove localized versions of the classical bound for the exponential weights estimator due to Leung and Barron and deviation-optimal bounds for the Q-aggregation estimator. These bounds improve over the results of Dai, Rigollet and Zhang for fixed design regression and the results of Lecu\'e and Rigollet for random design regression.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. P. Alquier. PAC-Bayesian bounds for randomized empirical risk minimizers. Mathematical Methods of Statistics, 17:279–304, 2008.
  2. P. Alquier. User-friendly introduction to PAC-Bayes bounds. arXiv preprint arXiv:2110.11216, 2021.
  3. M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 2009.
  4. N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337–404, 1950.
  5. J.-Y. Audibert. Aggregated estimators and empirical complexity for least square regression. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 40(6):685–736, 2004.
  6. J.-Y. Audibert. PAC-Bayesian statistical learning theory. PhD thesis, Université Paris VI, 2004.
  7. J.-Y. Audibert. Progressive mixture rules are deviation suboptimal. In Advances in Neural Information Processing Systems 20, pages 41–48, 2008.
  8. J.-Y. Audibert. Fast learning rates in statistical inference through aggregation. The Annals of Statistics, 37(4):1591–1646, 2009.
  9. J.-Y. Audibert and O. Catoni. Linear regression through PAC-Bayesian truncation. Preprint arXiv:1010.0072, 2010.
  10. J.-Y. Audibert and O. Catoni. Robust linear least squares regression. The Annals of Statistics, 39(5):2766–2794, 2011.
  11. Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 43(3):211–246, 2001.
  12. A. R. Barron. Are Bayes rules consistent in information? In Open Problems in Communication and Computation, pages 85–91. Springer, 1987.
  13. P. C. Bellec. Optimal bounds for aggregation of affine estimators. The Annals of Statistics, 46(1):30–59, 2018.
  14. G. Blanchard and N. Mücke. Optimal rates for regularization of statistical inverse learning problems. Foundations of Computational Mathematics, 18(4):971–1013, 2018.
  15. E. Carlen. Trace inequalities and quantum entropy: an introductory course. Entropy and the quantum, 529:73–140, 2010.
  16. O. Catoni. Statistical Learning Theory and Stochastic Optimization: Ecole d’Eté de Probabilités de Saint-Flour XXXI - 2001, volume 1851 of Lecture Notes in Mathematics. Springer-Verlag, 2004.
  17. O. Catoni. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, volume 56 of IMS Lecture Notes Monograph Series. Institute of Mathematical Statistics, 2007.
  18. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.
  19. N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, Cambridge, New York, USA, 2006.
  20. Elements of Information Theory. Wiley Series in Telecommunications and Signal Processing. Wiley-Interscience, New York, USA, 2nd edition, 2006.
  21. Deviation optimal learning using greedy Q𝑄Qitalic_Q-aggregation. The Annals of Statistics, 40(3):1878–1905, 2012.
  22. A. Dalalyan and A. B. Tsybakov. Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Machine Learning, 72(1):39–61, 2008.
  23. A. S. Dalalyan. Simple proof of the risk bound for denoising by exponential weights for asymmetric noise distributions. arXiv preprint arXiv:2212.12950, 2022.
  24. A. S. Dalalyan and J. Salmon. Sharp oracle inequalities for aggregation of affine estimators. The Annals of Statistics, 40(4):2327–2355, 2012.
  25. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Proceedings of Uncertainty in Artificial Intelligence, 2017.
  26. Metagrad: Adaptation using multiple learning rates in online learning. Journal of Machine Learning Research, 22(161):1–61, 2021.
  27. J. Forster and M. K. Warmuth. Relative expected instantaneous loss bounds. Journal of Computer and System Sciences, 64(1):76–102, 2002.
  28. D. P. Foster. Prediction in the worst case. The Annals of Statistics, 19:1084–1090, 1991.
  29. E. I. George. Minimax multiple shrinkage estimation. The Annals of Statistics, 14(1):188–205, 1986.
  30. PAC-Bayes, MAC-Bayes and conditional mutual information: Fast rate bounds that handle general VC classes. In Conference on Learning Theory, pages 2217–2247, 2021.
  31. A distribution-free theory of nonparametric regression. Springer, 2002.
  32. Towards a unified information-theoretic framework for generalization. Advances in Neural Information Processing Systems, 34:26370–26381, 2021.
  33. J. Hannan. Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957.
  34. S. Hanneke. Refined error bounds for several learning algorithms. Journal of Machine Learning Research, 17(1):4667–4721, 2016.
  35. S. Hanneke and L. Yang. Minimax analysis of active learning. Journal of Machine Learning Research, 16(1):3487–3602, 2015.
  36. Sequential prediction of individual sequences under general loss functions. IEEE Transactions on Information Theory, 44(5):1906–1925, 1998.
  37. Learning by mirror averaging. The Annals of Statistics, 36(5):2183–2206, 2008.
  38. Exponential tail local Rademacher complexity risk bounds without the Bernstein condition. arXiv preprint arXiv:2202.11461, 2022.
  39. V. Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems, volume 2033 of École d’Été de Probabilités de Saint-Flour. Springer-Verlag Berlin Heidelberg, 2011.
  40. W. M. Koolen and T. Van Erven. Second-order quantile methods for experts and combinatorial games. In Conference on Learning Theory, pages 1155–1175, 2015.
  41. G. Lecué and S. Mendelson. Aggregation via empirical risk minimization. Probability Theory and Related Fields, 145(3):591, 2009.
  42. G. Lecué and S. Mendelson. On the optimality of the aggregate with exponential weights for low temperatures. Bernoulli, 19(2):646–675, 2013.
  43. G. Lecué and P. Rigollet. Optimal learning with Q-aggregation. The Annals of Statistics, 42(1):211–224, 2014.
  44. G. Leung and A. R. Barron. Information theory and mixing least-squares regressions. IEEE Transactions on Information Theory, 52(8):3396–3410, 2006.
  45. Learning with square loss: localization through offset Rademacher complexity. In Proceedings of The 28th Conference on Learning Theory, pages 1260–1285, 2015.
  46. N. Littlestone. From on-line to batch learning. In Proceedings of the 2nd annual workshop on Computational Learning Theory, pages 269–284. Morgan Kaufmann Publishers Inc., 1989.
  47. G. Lugosi and G. Neu. Online-to-PAC conversions: Generalization bounds via regret analysis. arXiv preprint arXiv:2305.19674, 2023.
  48. P. Massart. Concentration inequalities and model selection. Ecole d’Eté de Probabilités de Saint-Flour XXXIII, volume 1896 of Lecture Notes in Mathematics. Springer, 2007.
  49. D. A. McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355–363, 1999.
  50. D. A. McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5–21, 2003.
  51. N. Mehta. Fast rates with high probability in exp-concave statistical learning. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 1085–1093, 2017.
  52. S. Mendelson. On aggregation for heavy-tailed classes. Probability Theory and Related Fields, 168:641–674, 2017.
  53. S. Mendelson. An unrestricted learning procedure. Journal of the ACM, 66(6):1–42, 2019.
  54. J. Mourtada and S. Gaïffas. An improper estimator with optimal excess risk in misspecified density estimation and logistic regression. Journal of Machine Learning Research, 23(31):1–49, 2022.
  55. Distribution-free robust linear regression. Mathematical Statistics and Learning, 4(3):253–292, 2021.
  56. A. Nemirovski. Topics in non-parametric statistics. Lectures on Probability Theory and Statistics: Ecole d’Ete de Probabilites de Saint-Flour XXVIII-1998, 28:85–277, 2000.
  57. Information-theoretic stability and generalization. Information-Theoretic Methods in Data Science, 10:302–329, 2021.
  58. Empirical entropy, minimax regret and minimax risk. Bernoulli, 23(2):789–824, 2017.
  59. P. Rigollet. Kullback-Leibler aggregation and misspecified generalized linear models. The Annals of Statistics, 40(2):639–665, 2012.
  60. O. Shamir. The sample complexity of learning linear predictors with the squared loss. Journal of Machine Learning Research, 16(108):3475–3486, 2015.
  61. T. Steinke and L. Zakynthinou. Reasoning about generalization via conditional mutual information. In Conference on Learning Theory, pages 3437–3452. PMLR, 2020.
  62. I. Steinwart and A. Christmann. Support Vector Machines. Information Science and Statistics. Springer-Verlag New York, 2008.
  63. A. B. Tsybakov. Optimal rates of aggregation. In Proceedings of the 16th Conference on Computational Learning Theory, pages 303–313, 2003.
  64. A regret-variance trade-off in online learning. Advances in Neural Information Processing Systems, 35, 2022.
  65. V. Vapnik. The Nature of Statistical Learning Theory. Springer Science & Business Media, 1999.
  66. Theory of Pattern Recognition: Statistical Learning Problems (in Russian). Nauka, Moscow, 1974.
  67. S. Vijaykumar. Localization, convexity, and star aggregation. Advances in Neural Information Processing Systems, 34:4570–4581, 2021.
  68. V. Vovk. A game of prediction with expert advice. Journal of Computer and System Sciences, 56(2):153–173, 1998.
  69. V. Vovk. Competitive on-line statistics. International Statistical Review, 69(2):213–248, 2001.
  70. M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.
  71. O. Wintenberger. Optimal learning with Bernstein online aggregation. Machine Learning, 106(1):119–141, 2017.
  72. Y. Yang. Mixing strategies for density estimation. The Annals of Statistics, 28(1):75–87, 2000.
  73. T. Zhang. From ε𝜀\varepsilonitalic_ε-entropy to KL-entropy: Analysis of minimum information complexity density estimation. The Annals of Statistics, 34(5):2180–2210, 2006.
  74. T. Zhang. Information-theoretic upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52(4):1307–1321, 2006.
  75. N. Zhivotovskiy and S. Hanneke. Localization of VC classes: Beyond local Rademacher complexities. Theoretical Computer Science, 742:27–49, 2018.
Citations (1)

Summary

We haven't generated a summary for this paper yet.