Information Theoretic Lower Bounds for Information Theoretic Upper Bounds (2302.04925v2)
Abstract: We examine the relationship between the mutual information between the output model and the empirical sample and the generalization of the algorithm in the context of stochastic convex optimization. Despite increasing interest in information-theoretic generalization bounds, it is uncertain if these bounds can provide insight into the exceptional performance of various learning algorithms. Our study of stochastic convex optimization reveals that, for true risk minimization, dimension-dependent mutual information is necessary. This indicates that existing information-theoretic generalization bounds fall short in capturing the generalization capabilities of algorithms like SGD and regularized ERM, which have dimension-independent sample complexity.
- D. Aldous. Random walks on finite groups and rapidly mixing markov chains. In Séminaire de Probabilités XVII 1981/82, pages 243–297. Springer, 1983.
- Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM (JACM), 44(4):615–631, 1997.
- An exact characterization of the generalization error for the gibbs algorithm. Advances in Neural Information Processing Systems, 34:8106–8118, 2021.
- Never go full batch (in stochastic convex optimization). Advances in Neural Information Processing Systems, 34:25033–25043, 2021a.
- Sgd generalizes better than gd (and regularization doesn’t help). In Conference on Learning Theory, pages 63–92. PMLR, 2021b.
- Learners that use little information. In Algorithmic Learning Theory, pages 25–55. PMLR, 2018.
- Stability of stochastic gradient descent on nonsmooth convex losses. Advances in Neural Information Processing Systems, 33:4381–4391, 2020.
- Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989.
- D. Boneh and J. Shaw. Collusion-secure fingerprinting for digital data. IEEE Transactions on Information Theory, 44(5):1897–1905, 1998.
- O. Bousquet and A. Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499–526, 2002.
- Tightening mutual information-based bounds on generalization error. IEEE Journal on Selected Areas in Information Theory, 1(1):121–130, 2020.
- Fingerprinting codes and the price of approximate differential privacy. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 1–10, 2014.
- An equivalence between private classification and online prediction. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 389–402. IEEE, 2020.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021.
- J. Duchi. Lecture notes for statistics 311/electrical engineering 377. URL: https://stanford. edu/class/stats311/Lectures/full notes. pdf. Last visited on, 2:23, 2016.
- V. Feldman. Generalization of erm in stochastic convex optimization: The dimension strikes back. Advances in Neural Information Processing Systems, 29, 2016.
- V. Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 954–959, 2020.
- Sharpened generalization bounds based on conditional mutual information and an application to noisy, iterative algorithms. Advances in Neural Information Processing Systems, 33:9925–9935, 2020.
- Limitations of information-theoretic generalization bounds for gradient descent methods in stochastic convex optimization. arXiv preprint arXiv:2212.13556, 2022.
- E. Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
- Privately learning high-dimensional distributions. In Conference on Learning Theory, pages 1853–1902. PMLR, 2019.
- J. Langford and J. Shawe-Taylor. Pac-bayes & margins. Advances in neural information processing systems, 15, 2002.
- R. Livni and S. Moran. A limitation of the pac-bayes framework. Advances in Neural Information Processing Systems, 33:20543–20553, 2020.
- D. A. McAllester. Some pac-bayesian theorems. In Proceedings of the eleventh annual conference on Computational learning theory, pages 230–234, 1998.
- D. A. McAllester. Pac-bayesian model averaging. In Proceedings of the twelfth annual conference on Computational learning theory, pages 164–170, 1999.
- Information-theoretic generalization bounds for sgld via data-dependent estimates. Advances in Neural Information Processing Systems, 32, 2019.
- Information-theoretic generalization bounds for stochastic gradient descent. In Conference on Learning Theory, pages 3526–3545. PMLR, 2021.
- In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
- R. Paley and A. Zygmund. A note on analytic functions in the unit circle. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 28, pages 266–272. Cambridge University Press, 1932.
- Generalization error bounds for noisy, iterative algorithms. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 546–550. IEEE, 2018.
- Finite littlestone dimension implies finite information complexity. In 2022 IEEE International Symposium on Information Theory (ISIT), pages 3055–3060. IEEE, 2022.
- P. Rigollet and J.-C. Hütter. High dimensional statistics. Lecture notes for course 18S997, 813(814):46, 2015.
- On random subset generalization error bounds and the stochastic gradient langevin dynamics algorithm. In 2020 IEEE Information Theory Workshop (ITW), pages 1–5. IEEE, 2021.
- D. Russo and J. Zou. How much does your data exploration overfit? controlling bias via information usage. IEEE Transactions on Information Theory, 66(1):302–323, 2019.
- R. E. Schapire. The strength of weak learnability. Machine learning, 5(2):197–227, 1990.
- S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
- Stochastic convex optimization. In COLT, volume 2, page 5, 2009.
- Fast rates for regularized objectives. Advances in neural information processing systems, 21, 2008.
- T. Steinke and J. Ullman. Interactive fingerprinting codes and the hardness of preventing false discovery. In Conference on learning theory, pages 1588–1628. PMLR, 2015.
- T. Steinke and L. Zakynthinou. Reasoning about generalization via conditional mutual information. In Conference on Learning Theory, pages 3437–3452. PMLR, 2020.
- G. Tardos. Optimal probabilistic fingerprint codes. Journal of the ACM (JACM), 55(2):1–24, 2008.
- L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
- On the uniform convergence of relative frequencies of events to their probabilities. In Measures of complexity, pages 11–30. Springer, 2015.
- Y. Wu. Lecture notes on information-theoretic methods for high-dimensional statistics. Lecture Notes for ECE598YW (UIUC), 16, 2017.
- A. Xu and M. Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. Advances in Neural Information Processing Systems, 30, 2017.
- Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
- Individually conditional individual mutual information bound on generalization error. IEEE Transactions on Information Theory, 68(5):3304–3316, 2022.