How Does Information Bottleneck Help Deep Learning?
(2305.18887v1)
Published 30 May 2023 in cs.LG, cs.AI, cs.CL, cs.CV, cs.IT, and math.IT
Abstract: Numerous deep learning algorithms have been inspired by and understood via the notion of information bottleneck, where unnecessary information is (often implicitly) minimized while task-relevant information is maximized. However, a rigorous argument for justifying why it is desirable to control information bottlenecks has been elusive. In this paper, we provide the first rigorous learning theory for justifying the benefit of information bottleneck in deep learning by mathematically relating information bottleneck to generalization errors. Our theory proves that controlling information bottleneck is one way to control generalization errors in deep learning, although it is not the only or necessary way. We investigate the merit of our new mathematical findings with experiments across a range of architectures and learning settings. In many cases, generalization errors are shown to correlate with the degree of information bottleneck: i.e., the amount of the unnecessary information at hidden layers. This paper provides a theoretical foundation for current and future methods through the lens of information bottleneck. Our new generalization bounds scale with the degree of information bottleneck, unlike the previous bounds that scale with the number of parameters, VC dimension, Rademacher complexity, stability or robustness. Our code is publicly available at: https://github.com/xu-ji/information-bottleneck
The paper provides the first rigorous generalization bound for learned encoders by linking conditional mutual information and parameter information to generalization error.
It refines previous results by replacing an exponential dependence on total mutual information with a linear dependence on I(X;Z|Y) for fixed encoders.
Empirical validations on toy and CIFAR10 datasets demonstrate that minimizing the combined metric of representation and model compression best predicts generalization performance.
The paper "How Does Information Bottleneck Help Deep Learning?" (Kawaguchi et al., 2023) addresses a fundamental open problem in deep learning theory: providing a rigorous statistical learning theory justification for why the information bottleneck (IB) principle is beneficial for generalization, particularly in the common scenario where intermediate representations (and thus the encoder mapping input to representation) are learned from the training data.
The information bottleneck principle suggests finding a representation Z of input X that is maximally relevant to the target Y while minimizing the information Z retains about X. This is often formalized as minimizing the mutual information I(X;Z) while maximizing I(Y;Z). While this principle has inspired various deep learning algorithms and been used to understand their behavior, a rigorous mathematical proof connecting the control of information bottlenecks to generalization error, especially when the representation function is learned, has been lacking. Previous work proposed a conjecture relating generalization error to 2I(X;Zls) [shwartz2019representation], but this conjecture was shown to be invalid when the encoder ϕls is learned from the training data s [hafez2020sample], as minimizing I(X;Zls) alone doesn't prevent the encoder's parameters from overfitting to the training data.
This paper makes two main theoretical contributions to fill this gap:
Improved Bound for Fixed/Independent Encoders (Theorem 1): The paper first provides a rigorous proof and significant improvement upon the previous conjecture for a simplified setting where the encoder ϕls is fixed independently of the training data s. The new generalization bound shows a dependence on I(X;Zls∣Y)/n rather than 2I(X;Zls)/n. This changes the dependence from exponential to linear in mutual information and uses the conditional mutual information I(X;Zls∣Y) (superfluous information about X given Y), which is more aligned with the IB goal than I(X;Zls) (total information about X). This result is relevant in practical scenarios like transfer learning where a pre-trained encoder (learned on independent data) is used.
where Δ(s) is the generalization error, n is the sample size, and G1l, $\Gcal_2^l$, G3l are terms that become constants as n→∞, involving factors like maximum loss $\Rcal(f^s)$ and sensitivity to nuisance variables cly(ϕls).
First Generalization Bound for Learned Encoders (Theorem 2 - Main Result): The paper's core contribution is providing the first rigorous generalization bound for the typical deep learning setting where the encoder ϕls is learned using the same training data s. This bound successfully addresses the overfitting issue highlighted by the counterexample. The main factor in this bound is the sum of two mutual information terms: I(X;Zls∣Y) and I(ϕlS;S).
I(X;Zls∣Y): This term captures the "representation compression" aspect of the information bottleneck – how much superfluous information about the input X is retained in the representation Zls after conditioning on the target Y. Minimizing this term is a goal of IB.
I(ϕlS;S): This term captures the "model compression" or overfitting aspect – how much information about the training dataset S is encoded in the learned parameters of the encoder ϕlS. This is a standard measure of model complexity's dependence on data, used in other information-theoretic bounds [xu2017information].
The theorem states that the generalization error is bounded by the minimum over potential layer choices l of a quantity depending linearly on the sum of these two mutual information terms:
Here, S denotes the random variable for the training dataset, emphasizing that I(ϕlS;S) is an expectation over dataset draws, while I(X;Zls∣Y) is conditioned on a specific dataset s. $G_1^l, G_3^l, \widehat{\Gcal}_2^l, \check{\Gcal}_2^l, \zeta$ are terms involving constants, maximum loss, sensitivity, and entropy terms, which become constant or bounded as n→∞. This bound reveals a fundamental trade-off: reducing superfluous information in the representation (first term) by training the encoder on data often requires the encoder to encode more information about the training data itself (second term). Good generalization requires balancing these two factors. The bound suggests minimizing their sum over suitable layers l.
Practical Implementation and Applications:
Applying these theoretical bounds in practice requires estimating the mutual information terms I(X;Zls∣Y) and I(ϕlS;S).
Estimating I(X;Zls∣Y): For deterministic neural networks, this can be challenging as mutual information can be infinite for continuous variables. The paper discusses approaches to handle this:
Binning: Discretizing the output of hidden layers into a finite number of bins and computing discrete mutual information.
Noise Injection (KDE): Assuming the deterministic output is subject to additive noise (e.g., Gaussian) for the purpose of analysis, allowing the use of kernel density estimation (KDE) to estimate MI for the now-stochastic representation. The paper provides theoretical justification for such methods (Corollary 1, Proposition 4), showing that bounding the original model's generalization error via a perturbed/binned model's MI bound requires adding a term related to the distance between the original and perturbed representations, indicating a trade-off in estimation granularity.
Stochastic Networks: Training networks that inherently produce stochastic representations (e.g., VAE encoders), where MI can be more directly estimated (e.g., using the reparameterization trick).
Estimating I(ϕlS;S): This requires reasoning about the distribution over learned parameters ϕlS given the training data S. The paper uses Stochastic Weight Averaging Gaussian (SWAG) [maddox2019simple] as an approximate Bayesian inference method to model the posterior distribution over parameters, from which I(ϕlS;S) can be estimated.
Experimental Validation:
The paper conducts experiments on toy 2D classification data and CIFAR10 image classification to empirically evaluate the predictive power of the proposed combined metric I(X;Zls∣Y)+I(ϕlS;S) against other metrics (parameter count, norm-based, I(X;Zls) alone, I(S;θ) alone) for generalization error.
Experiments on toy data involved training models with stochastic features, some explicitly constrained to have constant I(X;Zls) to test if I(X;Zls) alone is sufficient.
Experiments on CIFAR10 used standard deterministic DNNs (PreResNets), estimating I(X;Zls∣Y) using KDE with different variance selection schemes (adaptive, MLE). I(ϕlS;S) was estimated using SWAG.
Experiments with binning on toy deterministic models also tested the metrics.
The consistent finding across experiments is that the metrics combining representation compression (I(X;Zls∣Y)) and model compression (I(ϕlS;S)) are the strongest predictors of generalization error. Specifically, minimizing this combined term over layers (minl{I(X;Zls∣Y)+I(ϕlS;S)}) showed the highest correlation with the generalization gap. This empirical evidence strongly supports the paper's main theoretical finding, highlighting that both aspects of information compression are crucial for understanding generalization in deep learning.
Implementation Considerations:
Implementing the theoretical framework involves:
Selecting a method to estimate mutual information for the chosen network architecture (stochastic vs. deterministic, discrete vs. continuous features). This involves trade-offs between theoretical exactness (difficult for deterministic continuous features) and practicality (using approximations like binning, KDE, or training stochastic models).
Choosing an approach to model the distribution over learned parameters (e.g., Bayesian inference methods like SWAG).
Estimating the required mutual information terms during or after training. This can add significant computational overhead compared to standard training, especially for complex networks and large datasets.
The theoretical bounds include other terms (like G1,G2,G3) which depend on properties like Lipschitz constants and sensitivities. While the paper shows these terms can become constant asymptotically or be bounded, estimating them precisely in practice adds further complexity. The experiments focus on the MI terms as the primary factors.
In summary, this paper provides foundational theoretical support for the information bottleneck principle in deep learning by deriving the first rigorous generalization bounds for learning representations. It reveals that good generalization depends not just on compressing superfluous input information in the representation, but also on limiting the information about the training data encoded in the learned encoder parameters. The empirical results validate this theoretical insight, showing that a combined metric effectively predicts generalization performance. This work provides a new theoretical lens for understanding generalization and could inspire the development of new training algorithms or regularization techniques that explicitly or implicitly control both representation and model compression.