An Information-Theoretic Framework for Out-of-Distribution Generalization with Applications to Stochastic Gradient Langevin Dynamics (2403.19895v2)
Abstract: We study the Out-of-Distribution (OOD) generalization in machine learning and propose a general framework that establishes information-theoretic generalization bounds. Our framework interpolates freely between Integral Probability Metric (IPM) and $f$-divergence, which naturally recovers some known results (including Wasserstein- and KL-bounds), as well as yields new generalization bounds. Additionally, we show that our framework admits an optimal transport interpretation. When evaluated in two concrete examples, the proposed bounds either strictly improve upon existing bounds in some cases or match the best existing OOD generalization bounds. Moreover, by focusing on $f$-divergence and combining it with the Conditional Mutual Information (CMI) methods, we derive a family of CMI-based generalization bounds, which include the state-of-the-art ICIMI bound as a special instance. Finally, leveraging these findings, we analyze the generalization of the Stochastic Gradient Langevin Dynamics (SGLD) algorithm, showing that our derived generalization bounds outperform existing information-theoretic generalization bounds in certain scenarios.
- V. Vapnik and A. Y. Chervonenkis, “On the uniform convergence of relative frequencies of events to their probabilities,” Theory of Probability & Its Applications, vol. 16, no. 2, pp. 264–280, 1971.
- P. L. Bartlett and S. Mendelson, “Rademacher and gaussian complexities: Risk bounds and structural results,” Journal of Machine Learning Research, vol. 3, no. Nov, pp. 463–482, 2002.
- O. Bousquet and A. Elisseeff, “Stability and generalization,” The Journal of Machine Learning Research, vol. 2, pp. 499–526, 2002.
- D. A. McAllester, “Some pac-bayesian theorems,” in Proceedings of the eleventh annual conference on Computational learning theory, 1998, pp. 230–234.
- D. Russo and J. Zou, “Controlling bias in adaptive data analysis using information theory,” in Artificial Intelligence and Statistics. PMLR, 2016, pp. 1232–1240.
- A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability of learning algorithms,” Advances in Neural Information Processing Systems, vol. 30, 2017.
- Y. Bu, S. Zou, and V. V. Veeravalli, “Tightening mutual information-based bounds on generalization error,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 121–130, 2020.
- T. Steinke and L. Zakynthinou, “Reasoning about generalization via conditional mutual information,” in Conference on Learning Theory. PMLR, 2020, pp. 3437–3452.
- M. Haghifam, J. Negrea, A. Khisti, D. M. Roy, and G. K. Dziugaite, “Sharpened generalization bounds based on conditional mutual information and an application to noisy, iterative algorithms,” Advances in Neural Information Processing Systems, vol. 33, pp. 9925–9935, 2020.
- F. Hellström and G. Durisi, “Generalization bounds via information density and conditional information density,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 3, pp. 824–839, 2020.
- J. Negrea, M. Haghifam, G. K. Dziugaite, A. Khisti, and D. M. Roy, “Information-theoretic generalization bounds for sgld via data-dependent estimates,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- B. Rodríguez-Gálvez, G. Bassi, R. Thobaben, and M. Skoglund, “On random subset generalization error bounds and the stochastic gradient langevin dynamics algorithm,” in 2020 IEEE Information Theory Workshop (ITW). IEEE, 2021, pp. 1–5.
- R. Zhou, C. Tian, and T. Liu, “Individually conditional individual mutual information bound on generalization error,” IEEE Transactions on Information Theory, vol. 68, no. 5, pp. 3304–3316, 2022.
- M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz, “Invariant risk minimization,” arXiv preprint arXiv:1907.02893, 2019.
- X. Wu, J. H. Manton, U. Aickelin, and J. Zhu, “Information-theoretic analysis for transfer learning,” in 2020 IEEE International Symposium on Information Theory (ISIT). IEEE, 2020, pp. 2819–2824.
- M. S. Masiha, A. Gohari, M. H. Yassaee, and M. R. Aref, “Learning under distribution mismatch and model misspecification,” in 2021 IEEE International Symposium on Information Theory (ISIT). IEEE, 2021, pp. 2912–2917.
- Z. Wang and Y. Mao, “Information-theoretic analysis of unsupervised domain adaptation,” arXiv preprint arXiv:2210.00706, 2022.
- A. R. Esposito, M. Gastpar, and I. Issa, “Generalization error bounds via rényi-, f-divergences and maximal leakage,” IEEE Transactions on Information Theory, vol. 67, no. 8, pp. 4986–5004, 2021.
- G. Lugosi and G. Neu, “Generalization bounds via convex analysis,” in Conference on Learning Theory. PMLR, 2022, pp. 3524–3546.
- Y. Polyanskiy and Y. Wu, “Information theory: From coding to learning,” Book draft, 2022.
- J. Birrell, P. Dupuis, M. A. Katsoulakis, Y. Pantazis, and L. Rey-Bellet, “(f, γ𝛾\gammaitalic_γ)-divergences: interpolating between f-divergences and integral probability metrics,” The Journal of Machine Learning Research, vol. 23, no. 1, pp. 1816–1885, 2022.
- R. Agrawal and T. Horel, “Optimal bounds between f-divergences and integral probability metrics,” The Journal of Machine Learning Research, vol. 22, no. 1, pp. 5662–5720, 2021.
- A. Müller, “Integral probability metrics and their generating classes of functions,” Advances in applied probability, vol. 29, no. 2, pp. 429–443, 1997.
- M. Raginsky, A. Rakhlin, M. Tsao, Y. Wu, and A. Xu, “Information-theoretic analysis of stability and bias of learning algorithms,” in 2016 IEEE Information Theory Workshop (ITW). IEEE, 2016, pp. 26–30.
- V. Feldman and T. Steinke, “Calibrating noise to variance in adaptive data analysis,” in Conference On Learning Theory. PMLR, 2018, pp. 535–544.
- S. Boucheron, G. Lugosi, and P. Massart, “Concentration inequalities: A nonasymptotic theory of independence. univ. press,” 2013.