Tuning-Free Coreset Markov Chain Monte Carlo via Hot DoG (2410.18973v2)
Abstract: A Bayesian coreset is a small, weighted subset of a data set that replaces the full data during inference to reduce computational cost. The state-of-the-art coreset construction algorithm, Coreset Markov chain Monte Carlo (Coreset MCMC), uses draws from an adaptive Markov chain targeting the coreset posterior to train the coreset weights via stochastic gradient optimization. However, the quality of the constructed coreset, and thus the quality of its posterior approximation, is sensitive to the stochastic optimization learning rate. In this work, we propose a learning-rate-free stochastic gradient optimization procedure, Hot-start Distance over Gradient (Hot DoG), for training coreset weights in Coreset MCMC without user tuning effort. We provide a theoretical analysis of the convergence of the coreset weights produced by Hot DoG. We also provide empirical results demonstrate that Hot DoG provides higher quality posterior approximations than other learning-rate-free stochastic gradient methods, and performs competitively to optimally-tuned ADAM.
- Monte Carlo Statistical Methods. Springer, 2ndsuperscript2nd2^{\text{nd}}2 start_POSTSUPERSCRIPT nd end_POSTSUPERSCRIPT edition, 2004.
- A short history of Markov chain Monte Carlo: subjective recollections from incomplete data. Statistical Science, 26(1):102–115, 2011.
- Bayesian Data Analysis. CRC Press, 3rdsuperscript3rd3^{\text{rd}}3 start_POSTSUPERSCRIPT rd end_POSTSUPERSCRIPT edition, 2013.
- Coresets for scalable Bayesian logistic regression. In Advances in Neural Information Processing Systems, 2016.
- Fast Bayesian coresets via subsampling and quasi-Newton refinement. In Advances in Neural Information Processing Systems, 2022.
- Bayesian inference via sparse Hamiltonian flows. In Advances in Neural Information Processing Systems, 2022.
- Trevor Campbell. General bounds on the quality of Bayesian coresets. In Advances in Neural Information Processing Systems, 2024.
- Coreset Markov chain Monte Carlo. In International Conference on Artificial Intelligence and Statistics, 2024.
- Adam: a method for stochastic optimization. In International Conference on Learning Representations, 2014.
- Dog is SGD’s best friend: a parameter-free dynamic step size schedule. In International Conference on Machine Learning, 2023.
- DoWG unleashed: an efficient universal parameter-free gradient descent method. In Advances in Neural Information Processing Systems, 2023.
- Neural networks for machine learning lecture 6a: overview of mini-batch gradient descent, 2012.
- Sparse variational inference: Bayesian coresets from scratch. In Advances in Neural Information Processing Systems, 2019.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61):2121–2159, 2011.
- Making SGD parameter-free. In Conference on Learning Theory, 2022.
- Learning-rate-free learning by D-adaptation. In International Conference on Machine Learning, 2023.
- Prodigy: an expeditiously adaptive parameter-free learner. In International Conference on Machine Learning, 2024.
- International Conference on Machine Learning tutorial on parameter-free stochastic optimization, 2020.
- Painless stochastic gradient: interpolation, line-search, and convergence rates. In Advances in Neural Information Processing Systems, 2019.
- A stochastic line search method with expected complexity analysis. Society for Industrial and Applied Mathematics Journal on Optimization, 30(1):349–376, 2020.
- Stochastic polyak step-size for SGD: an adaptive learning rate for fast convergence. In International Conference on Artificial Intelligence and Statistics, 2021.
- Rank-normalization, folding, and localization: an improved R^^𝑅\widehat{R}over^ start_ARG italic_R end_ARG for assessing convergence of MCMC (with discussion). Bayesian Analysis, 16(2):667–718, 2021.
- Hit-and-run algorithms for generating multivariate distributions. Mathematics of Operations Research, 18(2):255–266, 1993.
- Radford Neal. Slice sampling. The Annals of Statistics, 31(3):705–767, 2003.
- Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88(423):881–889, 1993.
- Stan: a probabilistic programming language. Journal of Statistical Software, 76(1):1––32, 2017.
- Aprad Elo. The Rating of Chessplayers, Past and Present. Arco, 1stsuperscript1st1^{\text{st}}1 start_POSTSUPERSCRIPT st end_POSTSUPERSCRIPT edition, 1978.