Information-Theoretic Generalization Bounds for Deep Neural Networks (2404.03176v1)
Abstract: Deep neural networks (DNNs) exhibit an exceptional capacity for generalization in practical applications. This work aims to capture the effect and benefits of depth for supervised learning via information-theoretic generalization bounds. We first derive two hierarchical bounds on the generalization error in terms of the Kullback-Leibler (KL) divergence or the 1-Wasserstein distance between the train and test distributions of the network internal representations. The KL divergence bound shrinks as the layer index increases, while the Wasserstein bound implies the existence of a layer that serves as a generalization funnel, which attains a minimal 1-Wasserstein distance. Analytic expressions for both bounds are derived under the setting of binary Gaussian classification with linear DNNs. To quantify the contraction of the relevant information measures when moving deeper into the network, we analyze the strong data processing inequality (SDPI) coefficient between consecutive layers of three regularized DNN models: Dropout, DropConnect, and Gaussian noise injection. This enables refining our generalization bounds to capture the contraction as a function of the network architecture parameters. Specializing our results to DNNs with a finite parameter space and the Gibbs algorithm reveals that deeper yet narrower network architectures generalize better in those examples, although how broadly this statement applies remains a question.
- N. Golowich, A. Rakhlin, and O. Shamir, “Size-independent sample complexity of neural networks,” in Conference On Learning Theory. PMLR, 2018, pp. 297–299.
- B. Neyshabur, S. Bhojanapalli, D. Mcallester, and N. Srebro, “Exploring generalization in deep learning,” in Advances in Neural Information Processing Systems, vol. 30, 2017, pp. 5949–5958.
- T. Liang, T. Poggio, A. Rakhlin, and J. Stokes, “Fisher-Rao metric, geometry, and complexity of neural networks,” in The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 2019, pp. 888–896.
- S. Arora, R. Ge, B. Neyshabur, and Y. Zhang, “Stronger generalization bounds for deep nets via a compression approach,” in International Conference on Machine Learning. PMLR, 2018, pp. 254–263.
- G. K. Dziugaite and D. M. Roy, “Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data,” in Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence, UAI 2017. AUAI Press, 2017.
- D. A. McAllester, “Some Pac-Bayesian theorems,” in Proceedings of the 11th Annual Conference on Computational Learning Theory, 1998, pp. 230–234.
- ——, “PAC-Bayesian model averaging,” in Proceedings of the 12th Annual Conference on Computational Learning Theory, 1999, pp. 164–170.
- B. Neyshabur, S. Bhojanapalli, and N. Srebro, “A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks,” in International Conference on Learning Representations, 2018.
- W. Zhou, V. Veitch, M. Austern, R. P. Adams, and P. Orbanz, “Non-vacuous generalization bounds at the imagenet scale: a PAC-Bayesian compression approach,” in International Conference on Learning Representations, 2018.
- S. Hochreiter and J. Schmidhuber, “Flat minima,” Neural Computation, vol. 9, no. 1, pp. 1–42, 1997.
- L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio, “Sharp minima can generalize for deep nets,” in International Conference on Machine Learning. PMLR, 2017, pp. 1019–1028.
- N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, “On large-batch training for deep learning: Generalization gap and sharp minima,” in International Conference on Learning Representations, 2017.
- L. Wu, Z. Zhu, and W. E, “Towards understanding generalization of deep learning: Perspective of loss landscapes,” ICML 2017 Workshop on Principled Approaches to Deep Learning, Sydney, Australia, 2017.
- D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro, “The implicit bias of gradient descent on separable data,” The Journal of Machine Learning Research, vol. 19, no. 1, pp. 2822–2878, 2018.
- S. L. Smith and Q. V. Le, “A Bayesian perspective on generalization and stochastic gradient descent,” in International Conference on Learning Representations, 2018.
- S. Chatterjee and P. Zielinski, “On the generalization mystery in deep learning,” arXiv preprint arXiv:2203.10036, 2022.
- D. Jakubovitz, R. Giryes, and M. R. Rodrigues, “Generalization error in deep learning,” in Compressed Sensing and Its Applications: Third International MATHEON Conference 2017. Springer, 2019, pp. 153–193.
- C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning (still) requires rethinking generalization,” Communications of the ACM, vol. 64, no. 3, pp. 107–115, 2021.
- K. Kawaguchi, L. P. Kaelbling, and Y. Bengio, “Generalization in deep learning,” in Mathematical Aspects of Deep Learning. Cambridge University Press, 2022.
- A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability of learning algorithms,” Advances in Neural Information Processing Systems, vol. 30, 2017.
- D. Russo and J. Zou, “How much does your data exploration overfit? controlling bias via information usage,” IEEE Transactions on Information Theory, vol. 66, no. 1, pp. 302–323, 2019.
- Y. Bu, S. Zou, and V. V. Veeravalli, “Tightening mutual information-based bounds on generalization error,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 121–130, 2020.
- A. Asadi, E. Abbe, and S. Verdú, “Chaining mutual information and tightening generalization bounds,” in Advances in Neural Information Processing Systems, 2018, pp. 7234–7243.
- E. Clerico, A. Shidani, G. Deligiannidis, and A. Doucet, “Chained generalisation bounds,” in Conference on Learning Theory. PMLR, 2022, pp. 4212–4257.
- H. Hafez-Kolahi, Z. Golgooni, S. Kasaei, and M. Soleymani, “Conditioning and processing: Techniques to improve information-theoretic generalization bounds,” Advances in Neural Information Processing Systems, vol. 33, 2020.
- M. Haghifam, J. Negrea, A. Khisti, D. M. Roy, and G. K. Dziugaite, “Sharpened generalization bounds based on conditional mutual information and an application to noisy, iterative algorithms.” Advances in Neural Information Processing Systems, 2020.
- T. Steinke and L. Zakynthinou, “Reasoning about generalization via conditional mutual information,” in Conference on Learning Theory. PMLR, 2020, pp. 3437–3452.
- H. Harutyunyan, M. Raginsky, G. Ver Steeg, and A. Galstyan, “Information-theoretic generalization bounds for black-box learning algorithms,” Advances in Neural Information Processing Systems, vol. 34, pp. 24 670–24 682, 2021.
- A. R. Esposito, M. Gastpar, and I. Issa, “Generalization error bounds via Rényi-, f-divergences and maximal leakage,” IEEE Transactions on Information Theory, vol. 67, no. 8, pp. 4986–5004, 2021.
- G. Aminian, S. Masiha, L. Toni, and M. R. Rodrigues, “Learning algorithm generalization error bounds via auxiliary distributions,” arXiv preprint arXiv:2210.00483, 2022.
- G. Aminian, L. Toni, and M. R. Rodrigues, “Information-theoretic bounds on the moments of the generalization error of learning algorithms,” in IEEE International Symposium on Information Theory (ISIT), 2021.
- H. Wang, M. Diaz, J. C. S. Santos Filho, and F. P. Calmon, “An information-theoretic view of generalization via Wasserstein distance,” in 2019 IEEE International Symposium on Information Theory (ISIT). IEEE, 2019, pp. 577–581.
- N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 56, pp. 1929–1958, 2014.
- L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus, “Regularization of neural networks using dropconnect,” in International Conference on Machine Learning. PMLR, 2013, pp. 1058–1066.
- Z. Goldfeld, E. Van Den Berg, K. Greenewald, I. Melnyk, N. Nguyen, B. Kingsbury, and Y. Polyanskiy, “Estimating information flow in deep neural networks,” in International Conference on Machine Learning. PMLR, 2019, pp. 2299–2308.
- Z. Goldfeld, K. Greenewald, J. Niles-Weed, and Y. Polyanskiy, “Convergence of smoothed empirical measures with applications to entropy estimation,” IEEE Transactions on Information Theory, vol. 66, no. 7, pp. 4368–4391, 2020.
- I. Gitman, H. Lang, P. Zhang, and L. Xiao, “Understanding the role of momentum in stochastic gradient methods,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser, K. Kurach, and J. Martens, “Adding gradient noise improves learning for very deep networks,” arXiv preprint arXiv:1511.06807, 2015.
- C. M. Bishop, “Training with noise is equivalent to tikhonov regularization,” Neural Computation, vol. 7, no. 1, pp. 108–116, 1995.
- M. Raginsky, A. Rakhlin, and M. Telgarsky, “Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis,” in Conference on Learning Theory. PMLR, 2017, pp. 1674–1703.
- G. Aminian, Y. Bu, L. Toni, M. Rodrigues, and G. Wornell, “An exact characterization of the generalization error for the Gibbs algorithm,” Advances in Neural Information Processing Systems, vol. 34, pp. 8106–8118, 2021.
- B. Neyshabur, R. Tomioka, and N. Srebro, “In search of the real inductive bias: On the role of implicit regularization in deep learning,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings, 2015.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
- C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” in International Conference on Learning Representations, 2017.
- G. Vardi, “On the implicit bias in deep-learning algorithms,” Communications of the ACM, vol. 66, no. 6, pp. 86–93, 2023.
- R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via information,” 2017, arXiv:1703.00810 [cs.LG].
- A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox, “On the information bottleneck theory of deep learning,” in in International Conference on Learning Representations, 2018.
- R. A. Amjad and B. C. Geiger, “Learning representations for neural network-based classification using the information bottleneck principle,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 9, pp. 2225–2239, 2019.
- Z. Goldfeld and Y. Polyanskiy, “The information bottleneck problem and its applications in machine learning,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 19–38, 2020.
- A. T. Lopez and V. Jog, “Generalization error bounds using Wasserstein distances,” in 2018 IEEE Information Theory Workshop (ITW). IEEE, 2018, pp. 1–5.
- A. Pensia, V. Jog, and P.-L. Loh, “Generalization error bounds for noisy, iterative algorithms,” in 2018 IEEE International Symposium on Information Theory (ISIT). IEEE, 2018, pp. 546–550.
- H. Wang, R. Gao, and F. P. Calmon, “Generalization bounds for noisy iterative algorithms using properties of additive noise channels,” Journal of machine learning research, vol. 24, pp. 26:1–26:43, 2023.
- G. Neu, G. K. Dziugaite, M. Haghifam, and D. M. Roy, “Information-theoretic generalization bounds for stochastic gradient descent,” in Conference on Learning Theory. PMLR, 2021, pp. 3526–3545.
- Z. Wang and Y. Mao, “On the generalization of models trained with SGD: Information-theoretic bounds and implications,” in International Conference on Learning Representations, 2022.
- A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” Advances in Neural Information Processing Systems, vol. 30, pp. 6306–6315, 2017.
- B. Rodríguez Gálvez, G. Bassi, R. Thobaben, and M. Skoglund, “Tighter expected generalization error bounds via wasserstein distance,” Advances in Neural Information Processing Systems, vol. 34, pp. 19 109–19 121, 2021.
- A. L. Gibbs and F. E. Su, “On choosing and bounding probability metrics,” International Statistical Review, vol. 70, no. 3, pp. 419–435, 2002.
- J. E. Cohen, Y. Iwasa, G. Rautu, M. Beth Ruskai, E. Seneta, and G. Zbaganu, “Relative entropy under mappings by stochastic matrices,” Linear Algebra and its Applications, vol. 179, pp. 211–235, 1993.
- R. L. Dobrushin, “Central limit theorem for nonstationary markov chains. i,” Theory of Probability & Its Applications, vol. 1, no. 1, pp. 65–80, 1956.
- Y. Polyanskiy and Y. Wu, “Strong data-processing inequalities for channels and Bayesian networks,” in Convexity and Concentration. Springer, 2017, pp. 211–249.
- M. Welling and Y. W. Teh, “Bayesian learning via stochastic gradient langevin dynamics,” in Proceedings of the 28th international conference on machine learning (ICML-11), 2011, pp. 681–688.
- M. Raginsky, A. Rakhlin, M. Tsao, Y. Wu, and A. Xu, “Information-theoretic analysis of stability and bias of learning algorithms,” in 2016 IEEE Information Theory Workshop (ITW). IEEE, 2016, pp. 26–30.
- P. Kidger and T. Lyons, “Universal approximation with deep narrow networks,” in Conference on learning theory. PMLR, 2020, pp. 2306–2327.
- J. Lee, J. Y. Choi, E. K. Ryu, and A. No, “Neural tangent kernel analysis of deep narrow neural networks,” in International Conference on Machine Learning. PMLR, 2022, pp. 12 282–12 351.
- O. Ordentlich and Y. Polyanskiy, “Strong data processing constant is achieved by binary inputs,” IEEE Transactions on Information Theory, vol. 68, no. 3, pp. 1480–1481, 2021.
- Y. Polyanskiy and Y. Wu, “Dissipation of information in channels with input constraints,” IEEE Transactions on Information Theory, vol. 62, no. 1, pp. 35–55, 2015.
- S. Barsov and V. V. Ul’yanov, “Estimates of the proximity of Gaussian measures,” in Soviet Mathematics–Doklady, vol. 34, 1987, pp. 462–466.