Scalable PAC-Bayesian Meta-Learning via the PAC-Optimal Hyper-Posterior: From Theory to Practice (2211.07206v3)
Abstract: Meta-Learning aims to speed up the learning process on new tasks by acquiring useful inductive biases from datasets of related learning tasks. While, in practice, the number of related tasks available is often small, most of the existing approaches assume an abundance of tasks; making them unrealistic and prone to overfitting. A central question in the meta-learning literature is how to regularize to ensure generalization to unseen tasks. In this work, we provide a theoretical analysis using the PAC-Bayesian theory and present a generalization bound for meta-learning, which was first derived by Rothfuss et al. (2021a). Crucially, the bound allows us to derive the closed form of the optimal hyper-posterior, referred to as PACOH, which leads to the best performance guarantees. We provide a theoretical analysis and empirical case study under which conditions and to what extent these guarantees for meta-learning improve upon PAC-Bayesian per-task learning bounds. The closed-form PACOH inspires a practical meta-learning approach that avoids the reliance on bi-level optimization, giving rise to a stochastic optimization problem that is amenable to standard variational methods that scale well. Our experiments show that, when instantiating the PACOH with Gaussian processes and Bayesian Neural Networks models, the resulting methods are more scalable, and yield state-of-the-art performance, both in terms of predictive accuracy and the quality of uncertainty estimates.
- Pierre Alquier. PAC-Bayesian bounds for randomized empirical risk minimizers. Mathematical Methods of Statistics, 17:279–304, 2008.
- Pierre Alquier. User-friendly introduction to PAC-Bayes bounds. arXiv preprint arXiv:2110.11216, 2021.
- Simpler PAC-Bayesian bounds for hostile data. Machine Learning, 107:887–902, 2018.
- On the properties of variational approximations of Gibbs posteriors. Journal of Machine Learning Research, 17:1–41, 2016.
- Meta-learning by adjusting priors based on extended PAC-Bayes theory. In International Conference on Machine Learning, 2018.
- Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, 2016.
- Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397–422, 2002.
- Jonathan Baxter. A model of inductive bias learning. Journal of Artificial Intelligence Research, 2000.
- Learning a synaptic learning rule. In International Joint Conference on Neural Networks, 1991.
- Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International Conference on Machine Learning, 2013.
- Variational inference: A review for statisticians. Journal of the American Statistical Association, 112:859–877, 2017.
- Multi-task Gaussian Process prediction. In Advances in Neural Information Processing Systems, 2008.
- Concentration inequalities: a nonasymptotic theory of independence. Oxford University Press, 2013.
- Olivier Catoni. PAC-Bayesian supervised classification: the thermodynamics of statistical learning. IMS Lecture Notes Monograph Series, 56, 2007.
- Multi-task and meta-learning with sparse linear bandits. In Uncertainty in Artificial Intelligence, 2021.
- Meta representation learning with contextual linear bandits. arXiv preprint arXiv:2205.15100, 2022.
- Generalization bounds for meta-learning: An information-theoretic analysis. Advances in Neural Information Processing Systems, 2021.
- Learning to Learn without Gradient Descent by Gradient Descent. In International Conference on Machine Learning, 2017.
- Imre Csiszár. I-divergence geometry of probability distributions and minimization problems. The Annals of Probability, 3:146–158, 1975.
- Bridging the Gap Between Practice and PAC-Bayes Theory in Few-Shot Meta-Learning. In Advances in Neural Information Processing Systems, 2021.
- Large deviations for Markov processes and the asymptotic evaluation of certain Markov process expectations for large times. In Probabilistic Methods in Differential Equations, 1975.
- Data-dependent PAC-Bayes priors via differential privacy. Advances in Neural Information Processing Systems, 2018.
- On the role of data in PAC-Bayes bounds. In International Conference on Artificial Intelligence and Statistics, 2021.
- Generalization bounds for meta-learning via PAC-Bayes and uniform stability. In Advances in Neural Information Processing Systems, 2021.
- Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017.
- Probabilistic model-agnostic meta-learning. In Advances in Neural Information Processing Systems, 2018.
- Deep Mean Functions for Meta-Learning in Gaussian Processes. arXiv preprint arXiv:1901.08098, 2019.
- Bayesian neural network priors revisited. In International Conference on Learning Representations, 2022.
- Neural processes. In ICML 2018 workshop on Theoretical Foundations and Applications of Deep Generative Models, 2018.
- PAC-Bayesian theory meets Bayesian inference. In Advances in Neural Information Processing Systems, 2016.
- Unraveling meta-learning: Understanding feature representations for few-shot tasks. In International Conference on Machine Learning, 2020.
- Multiple kernel learning algorithms. Journal of Machine Learning Research, 12:2211–2268, 2011.
- Recasting gradient-based meta-learning as hierarchical Bayes. In International Conference on Learning Representations, 2018.
- Fast-rate PAC-Bayesian generalization bounds for meta-learning. In International Conference on Machine Learning, 2022.
- Benjamin Guedj. A primer on PAC-Bayesian learning. In 2nd Congress of the French Mathematical Society, 2019.
- On calibration of modern neural networks. In International Conference on Machine Learning, 2017.
- Learning To Learn Using Gradient Descent. In International Conference on Artificial Neural Networks, 2001.
- Matthew Holland. PAC-Bayes under potentially heavy tails. Advances in Neural Information Processing Systems, 2019.
- Information-theoretic generalization bounds for meta-learning and applications. Entropy, 23:126, 2021.
- Information-Theoretic Analysis of Epistemic Uncertainty in Bayesian Meta-learning. In International Conference on Artificial Intelligence and Statistics, 2022.
- Herman Kahn. Use of Different Monte Carlo Sampling Techniques. RAND Corporation, 1955.
- Meta-learning hypothesis spaces for sequential decision-making. In International Conference on Machine Learning, 2022.
- Attentive neural processes. In International Conference on Learning Representations, 2019.
- Auto-Encoding Variational Bayes. In International Conference on Learning Representations, 2014.
- Adaptive and Safe Bayesian Optimization in High Dimensions via One-Dimensional Subspaces. In International Conference on Machine Learning, 2019a.
- Bayesian Optimization for Fast and Safe Parameter Tuning of SwissFEL. In International Free-Electron Laser Conference, 2019b.
- Contextual Gaussian Process bandit optimization. In Advances in Neural Information Processing Systems, 2011.
- Accurate uncertainties for deep learning using calibrated regression. In International Conference on Machine Learning, 2018.
- Human-level concept learning through probabilistic program induction. Science, 350:1332–1338, 2015.
- Tighter PAC-Bayes bounds through distribution-dependent priors. Theoretical Computer Science, 473:4–28, 2013.
- Meta-sgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835, 2017.
- Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. In Advances in Neural Information Processing Systems, 2016.
- Statistical generalization performance guarantee for meta-learning with data-dependent prior. Neurocomputing, 465:391–405, 2021.
- SUMO: Unbiased Estimation of Log Marginal Probability for Latent Variable Models. In International Conference on Learning Representations, 2020.
- Samuel Madden. Intel lab data. http://db.csail.mit.edu/labdata/labdata.html, 2004. Accessed: Sep 8, 2020.
- Andreas Maurer. A note on the PAC Bayesian theorem. arXiv preprint arXiv:0411099, 2004.
- Algorithmic stability and meta-learning. Journal of Machine Learning Research, 6:967–994, 2005.
- David A McAllester. Some PAC-Bayesian theorems. Machine Learning, 1999.
- Kernels for Multi-task Learning. Advances in Neural Information Processing Systems, 2004.
- SwissFEL: the Swiss X-ray Free Electron Laser. Applied Sciences, 2017.
- A Simple Neural Attentive Meta-Learner. In International Conference on Learning Representations, 2018.
- On First-Order Meta-Learning Algorithms. arXiv, 2018.
- PAC-Bayesian analysis of distribution dependent priors: Tighter risk bounds and stability analysis. Pattern Recognition Letters, 80:200–207, 2016.
- Learning the Kernel with Hyperkernels. Journal of Machine Learning Research, 3:1043–1071, 2005.
- Large margin multi-task metric learning. Advances in Neural Information Processing Systems, 2010.
- PAC-Bayes Bounds with Data Dependent Priors. Journal of Machine Learning Research, 13:3507–3531, 2012.
- A PAC-Bayesian bound for lifelong learning. In International Conference on Machine Learning, 2014.
- Tighter risk certificates for neural networks. Journal of Machine Learning Research, 22:10326–10365, 2021.
- Physically Based Rendering: From Theory to Implementation, chapter 13.7. Morgan Kaufmann, 3rd edition, 2016.
- Rethink and redesign meta learning. arXiv preprint arXiv:1812.04955, 2018.
- Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes in machine learning. MIT Press, 2006.
- Amortized Bayesian meta-learning. In International Conference on Learning Representations, 2018.
- Optimization as a Model for Few-Shot Learning. In International Conference on Learning Representations, 2017.
- Learning Gaussian Processes by Minimizing PAC-Bayesian Generalization Bounds. In Advances in Neural Information Processing Systems, 2018.
- Conditional mutual information-based generalization bound for meta learning. In International Symposium on Information Theory, 2021.
- Variational inference with normalizing flows. International Conference on Machine Learning, 2015.
- PAC-Bayes bounds for stable algorithms with instance-dependent priors. Advances in Neural Information Processing Systems, 2018.
- ProMP: Proximal Meta-Policy Search. In International Conference on Learning Representations, 2019.
- PACOH: Bayes-Optimal Meta-Learning with PAC-Guarantees. In International Conference on Machine Learning, 2021a.
- Meta-learning Reliable Priors in the Function Space. In Advances in Neural Information Processing Systems, 2021b.
- Meta-Learning Priors for Safe Bayesian Optimization. In Conference on Robot Learning, 2022.
- One-shot learning with a hierarchical nonparametric Bayesian model. In ICML Workshop on Unsupervised and Transfer Learning, 2012.
- Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning, 2016.
- Juergen Schmidhuber. Evolutionary principles in self-referential learning. PhD thesis, Technische Universitaet Munchen, 1987.
- Lifelong bandit optimization: no prior and no regret. In Uncertainty in Artificial Intelligence, 2023.
- Matthias Seeger. PAC-Bayesian generalisation error bounds for Gaussian process classification. Journal of Machine Learning Research, 3:233–269, 2002.
- Multi-task learning as multi-objective optimization. In Advances in Neural Information Processing Systems, 2018.
- Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2015.
- Improved PAC-Bayesian bounds for linear regression. In AAAI Conference on Artificial Intelligence, 2020.
- Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.
- Predicting in-hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. In Computing in Cardiology, 2012.
- Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, 2017.
- Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning, 2009.
- Unifying variational inference and pac-bayes for supervised learning that scales. arXiv preprint arXiv:1910.10367, 2019.
- William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25:285––294, 1933.
- Learning to Learn. Springer, 1998.
- Matching networks for one shot learning. In Advances in Neural Information Processing Systems, 2016.
- Inferring latent task structure for multitask learning by multiple kernel learning. Bioinformatics, 2010.
- Deep kernel learning. In International Conference on Artificial Intelligence and Statistics, 2016.
- Metafun: Meta-learning with iterative functional updates. In International Conference on Machine Learning, 2020.
- On approximating the modified bessel function of the second kind. Journal of Inequalities and Applications, 41:1–8, 2017.
- Meta-learning without memorization. In International Conference on Learning Representations, 2020.
- Bayesian model-agnostic meta-learning. In Advances in Neural Information Processing Systems, 2018.
- Learning Gaussian processes from multiple tasks. In International Conference on Machine Learning, 2005.
- Multiclass Multiple Kernel Learning. In International Conference on Machine Learning, 2007.
Collections
Sign up for free to add this paper to one or more collections.