A unified recipe for deriving (time-uniform) PAC-Bayes bounds (2302.03421v5)
Abstract: We present a unified framework for deriving PAC-Bayesian generalization bounds. Unlike most previous literature on this topic, our bounds are anytime-valid (i.e., time-uniform), meaning that they hold at all stopping times, not only for a fixed sample size. Our approach combines four tools in the following order: (a) nonnegative supermartingales or reverse submartingales, (b) the method of mixtures, (c) the Donsker-Varadhan formula (or other convex duality principles), and (d) Ville's inequality. Our main result is a PAC-Bayes theorem which holds for a wide class of discrete stochastic processes. We show how this result implies time-uniform versions of well-known classical PAC-Bayes bounds, such as those of Seeger, McAllester, Maurer, and Catoni, in addition to many recent bounds. We also present several novel bounds. Our framework also enables us to relax traditional assumptions; in particular, we consider nonstationary loss functions and non-i.i.d. data. In sum, we unify the derivation of past bounds and ease the search for future bounds: one may simply check if our supermartingale or submartingale conditions are met and, if so, be guaranteed a (time-uniform) PAC-Bayes bound.
- A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B (Methodological), 28(1):131–142, 1966.
- Pierre Alquier. PAC-Bayesian bounds for randomized empirical risk minimizers. Mathematical Methods of Statistics, 17(4):279–304, 2008.
- Pierre Alquier. User-friendly introduction to PAC-Bayes bounds. arXiv:2110.11216, 2021.
- Simpler PAC-Bayesian bounds for hostile data. Machine Learning, 107(5):887–902, 2018.
- On the properties of variational approximations of Gibbs posteriors. The Journal of Machine Learning Research, 17(1):8374–8414, 2016.
- Tighter PAC-Bayes bounds. Advances in Neural Information Processing Systems, 19, 2006.
- Integral probability metrics PAC-Bayes bounds. Advances in Neural Information Processing Systems, 36, 2022.
- Jean-Yves Audibert. Aggregated estimators and empirical complexity for least square regression. In Annales de l’IHP Probabilités et statistiques, volume 40, pages 685–736, 2004.
- PAC-Bayes learning bounds for sample-dependent priors. Advances in Neural Information Processing Systems, 33:4403–4414, 2020.
- Akshay Balsubramani. PAC-Bayes iterated logarithm bounds for martingale mixtures. arXiv:1506.06573, 2015.
- PAC-Bayesian bounds based on the Rényi divergence. In Artificial Intelligence and Statistics, pages 435–444. PMLR, 2016.
- Exponential inequalities for self-normalized martingales with applications. The Annals of Applied Probability, 18(5):1848–1869, 2008.
- Robert H Berk. Limiting behavior of posterior distributions when the model is incorrect. The Annals of Mathematical Statistics, 37(1):51–58, 1966.
- Non-vacuous generalisation bounds for shallow neural networks. In International Conference on Machine Learning, pages 1963–1981. PMLR, 2022.
- Tighter PAC-Bayes generalisation bounds by leveraging example difficulty. In International Conference on Artificial Intelligence and Statistics, pages 8165–8182. PMLR, 2023.
- Occam’s hammer. In International Conference on Computational Learning Theory, pages 112–126. Springer, 2007.
- Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press, 2013.
- Olivier Catoni. A PAC-Bayesian approach to adaptive classification. Technical report, Laboratoire de Probabilités et Modèles Aléatoires, Universités Paris 6 and Paris 7, 2003.
- Olivier Catoni. Statistical learning theory and stochastic optimization: Ecole d’Eté de Probabilités de Saint-Flour, XXXI-2001, volume 1851. Springer Science & Business Media, 2004.
- Olivier Catoni. PAC-Bayesian supervised classification: The thermodynamics of statistical learning. Monograph, Institute of Mathematical Statistics lecture notes, 2007.
- Olivier Catoni. Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’IHP Probabilités et Statistiques, volume 48, pages 1148–1185, 2012.
- Dimension-free PAC-Bayesian bounds for matrices, vectors, and linear least squares regression. arXiv:1712.02747, 2017.
- Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector. (Almost) 50 Shades of Bayesian Learning: PAC-Bayesian Trends and Insights. NeurIPS Workshop., 2018.
- A generalized Catoni’s M-estimator under finite α𝛼\alphaitalic_α-th moment assumption with α∈(1,2)𝛼12\alpha\in(1,2)italic_α ∈ ( 1 , 2 ). Electronic Journal of Statistics, 15(2):5523–5544, 2021.
- Imre Csiszár. I-divergence geometry of probability distributions and minimization problems. The Annals of Probability, pages 146–158, 1975.
- Confidence sequences for mean, variance, and median. Proceedings of the National Academy of Sciences, 58(1):66–68, 1967a.
- Iterated logarithm inequalities. Proceedings of the National Academy of Sciences, 57(5):1188–1192, 1967b.
- Bernard Delyon. Exponential inequalities for sums of weakly dependent variables. Electronic Journal of Probability, 14:752–779, 2009.
- MD Donsker and SRS Varadhan. Large deviations for Markov processes and the asymptotic evaluation of certain Markov process expectations for large times. In Probabilistic Methods in Differential Equations, pages 82–88. Springer, 1975.
- Rick Durrett. Probability: theory and examples. Cambridge university press, 2019.
- Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. Uncertainty in Artificial Intelligence, 2017.
- Exponential inequalities for martingales with applications. Electronic Journal of Probability, 20:1–22, 2015.
- PAC-Bayesian policy evaluation for reinforcement learning. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pages 195–202, 2011.
- PAC-Bayesian lifelong learning for multi-armed bandits. Data Mining and Knowledge Discovery, 36(2):841–876, 2022.
- PAC-Bayes bounds for bandit problems: A survey and experimental comparison. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- How tight can PAC-Bayes be in the small data regime? Advances in Neural Information Processing Systems, 34:4093–4105, 2021.
- PAC-Bayesian learning of linear classifiers. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 353–360, 2009.
- Risk bounds for the majority vote: From a PAC-Bayesian analysis to a learning algorithm. Journal of Machine Learning Research, 16(26):787–860, 2015.
- A new PAC-Bayesian perspective on domain adaptation. In International conference on machine learning, pages 859–868. PMLR, 2016.
- Safe testing. Journal of the Royal Statistical Society, Series B. To appear with discussion, 2023.
- Benjamin Guedj. A primer on PAC-Bayesian learning. Proceedings of the second congress of the French Mathematical Society, 33, 2019.
- PAC-Bayesian estimation and prediction in sparse additive models. Electronic Journal of Statistics, 7:264–291, 2013.
- PAC-Bayes generalisation bounds for heavy-tailed losses through supermartingales. Transactions on Machine Learning Research, 2023.
- PAC-Bayes unleashed: generalisation bounds with unbounded losses. Entropy, 23(10):1330, 2021.
- Time-uniform Chernoff bounds via nonnegative supermartingales. Probability Surveys, 17:257–317, 2020.
- Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics, 49(2):1055–1080, 2021.
- A bandit approach to sequential experimental design with false discovery control. Advances in Neural Information Processing Systems, 31, 2018.
- Tighter PAC-Bayes bounds through coin-betting. Conference on Learning Theory, 2023.
- Achim Klenke. Probability theory: a comprehensive course. Springer Science & Business Media, 2013.
- Solomon Kullback. Information theory and statistics. Wiley, New York, 1959.
- On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86, 1951.
- Efron-Stein PAC-Bayesian inequalities. arXiv preprint arXiv:1909.01931, 2019.
- Tze Leung Lai. On confidence sequences. The Annals of Statistics, pages 265–280, 1976.
- Bounds for averaging classifiers. School of Computer Science, Carnegie Mellon University, 2001.
- A J Lee. U-statistics: Theory and Practice. Routledge, 2019.
- Dichotomize and generalize: PAC-Bayesian binary activated deep neural networks. Advances in Neural Information Processing Systems, 32, 2019.
- Distribution-dependent PAC-Bayes priors. In International Conference on Algorithmic Learning Theory, pages 119–133. Springer, 2010.
- A PAC-Bayesian approach to generalization bounds for graph neural networks. In International Conference on Learning Representations, 2020.
- A limitation of the PAC-Bayes framework. Advances in Neural Information Processing Systems, 33:20543–20553, 2020.
- Martingale methods for sequential estimation of convex functionals and divergences. IEEE Transactions on Information Theory, 2023.
- Andreas Maurer. A note on the PAC Bayesian theorem. arXiv:cs/0411099, 2004.
- David McAllester. Some PAC-Bayesian theorems. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pages 230–234, 1998.
- David McAllester. PAC-Bayesian model averaging. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, pages 164–170, 1999.
- David McAllester. Simplified PAC-Bayesian margin bounds. In Learning Theory and Kernel Machines, pages 203–215. Springer, 2003.
- PAC-Bayes un-expected Bernstein inequality. Advances in Neural Information Processing Systems, 32, 2019.
- Novel change of measure inequalities with applications to PAC-Bayesian bounds and Monte Carlo estimation. In International Conference on Artificial Intelligence and Statistics, pages 1711–1719. PMLR, 2021.
- Tight concentrations and confidence sequences from the regret of universal portfolio. IEEE Transactions on Information Theory, 2023.
- PAC-Bayes bounds with data dependent priors. The Journal of Machine Learning Research, 13(1):3507–3531, 2012.
- Tighter risk certificates for neural networks. Journal of Machine Learning Research, 22, 2021.
- Admissible anytime-valid sequential inference must rely on nonnegative martingales. arXiv:2009.03167, 2020.
- Game-theoretic statistics and safe anytime-valid inference. Statistical Science (forthcoming), 2023.
- PAC-Bayes analysis beyond the usual bounds. Advances in Neural Information Processing Systems, 33:16833–16845, 2020.
- Herbert Robbins. Statistical methods related to the law of the iterated logarithm. The Annals of Mathematical Statistics, 41(5):1397–1409, 1970.
- PAC-Bayesian offline contextual bandits with guarantees. In International Conference on Machine Learning, pages 29777–29799. PMLR, 2023.
- Matthias Seeger. PAC-Bayesian generalisation error bounds for Gaussian process classification. Journal of Machine Learning Research, 3:233–269, 2002.
- Matthias Seeger. Bayesian Gaussian process models: PAC-Bayesian generalisation error bounds and sparse approximations. Technical report, University of Edinburgh, 2003.
- PAC-Bayesian generalization bound for density estimation with application to co-clustering. In Artificial Intelligence and Statistics, pages 472–479. PMLR, 2009.
- PAC-Bayesian analysis of co-clustering and beyond. Journal of Machine Learning Research, 11(12), 2010.
- PAC-Bayesian inequalities for martingales. IEEE Transactions on Information Theory, 58(12):7086–7093, 2012.
- Game-theoretic foundations for probability and finance, volume 455. John Wiley & Sons, 2019.
- A PAC analysis of a Bayesian estimator. In Proceedings of the Tenth Annual Conference on Computational learning theory, pages 2–9, 1997.
- On integral probability metrics, ϕitalic-ϕ\phiitalic_ϕ-divergences and binary classification. arXiv:0901.2698, 2009.
- On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550–1599, 2012.
- A strongly quasiconvex PAC-Bayesian bound. In International Conference on Algorithmic Learning Theory, pages 466–492. PMLR, 2017.
- PAC-Bayes-empirical-Bernstein inequality. Advances in Neural Information Processing Systems, 26, 2013.
- Jean Ville. Étude critique de la notion de collectif. Bull. Amer. Math. Soc, 45(11):824, 1939.
- Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press, 2019.
- A Wald. Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics, 16(2):117–186, 1945.
- Catoni-style confidence sequences for heavy-tailed mean estimation. Stochastic Processes and their Applications, 2023a.
- Huber-robust confidence sequences. International Conference on Artificial Intelligence and Statistics, 2023b.
- Estimating means of bounded random variables by betting. Journal of the Royal Statistical Society: Series B (Methodological), to appear with discussion, 2023.
- Time-uniform central limit theory and asymptotic confidence sequences. arXiv:2103.06476, 2021.
- Anytime-valid off-policy inference for contextual bandits. ACM/IMS Journal of Data Science (forthcoming), 2023.
- A unified framework for bandit multiple testing. Advances in Neural Information Processing Systems, 34:16833–16845, 2021.