Adaptive Estimators Show Information Compression in Deep Neural Networks (1902.09037v2)

Published 24 Feb 2019 in cs.LG, cs.NE, and stat.ML

Abstract: To improve how neural networks function it is crucial to understand their learning process. The information bottleneck theory of deep learning proposes that neural networks achieve good generalization by compressing their representations to disregard information that is not relevant to the task. However, empirical evidence for this theory is conflicting, as compression was only observed when networks used saturating activation functions. In contrast, networks with non-saturating activation functions achieved comparable levels of task performance but did not show compression. In this paper we developed more robust mutual information estimation techniques, that adapt to hidden activity of neural networks and produce more sensitive measurements of activations from all functions, especially unbounded functions. Using these adaptive estimation techniques, we explored compression in networks with a range of different activation functions. With two improved methods of estimation, firstly, we show that saturation of the activation function is not required for compression, and the amount of compression varies between different activation functions. We also find that there is a large amount of variation in compression between different network initializations. Secondary, we see that L2 regularization leads to significantly increased compression, while preventing overfitting. Finally, we show that only compression of the last layer is positively correlated with generalization.

Citations (33)

View on Semantic Scholar

Summary

The paper introduces two adaptive techniques, EBAB and aKDE, to accurately measure mutual information in networks using non-saturating activations.
The paper demonstrates that variability from initialization and L2 regularization significantly influence compression dynamics across layers.
The paper finds that while hidden layer compression may not always predict generalization, last-layer compression is positively linked to performance.

Deep neural networks (DNNs) have achieved remarkable performance across various tasks, but a complete understanding of their learning process and generalization capabilities remains an active area of research. The information bottleneck (IB) theory suggests that DNNs generalize well by compressing their representations, discarding task-irrelevant information while retaining relevant features. However, empirical studies using standard mutual information (MI) estimation techniques showed conflicting results, particularly for networks employing non-saturating activation functions like ReLU, where compression was not consistently observed, unlike networks with saturating functions like tanh.

This paper addresses the challenge of accurately measuring mutual information in deep neural networks, particularly when using non-saturating activation functions, which lead to unbounded hidden layer activity. Standard MI estimation methods, such as fixed-range binning or kernel density estimation (KDE) with fixed noise levels, prove inadequate because the distribution and range of hidden activations vary significantly across layers and training epochs (as illustrated in Figure 1 of the paper). Using non-adaptive methods can lead to inconsistent and potentially misleading MI estimates, often underestimating compression.

To overcome this, the authors propose two adaptive estimation techniques:

Entropy-Based Adaptive Binning (EBAB): Instead of using fixed-width bins or a fixed range based on the maximum activation across the entire network and training process, EBAB determines bin boundaries dynamically for each layer at each epoch. The boundaries are chosen such that each bin contains an equal number of unique observed activation levels for that specific layer and epoch. Repeated activation levels (common near saturation points) are ignored when defining boundaries but included when calculating entropy. This ensures that the binning adapts to the specific distribution of activations in a layer, even if the values are unbounded or heavily skewed (Figure 2). For a deterministic network where hidden activity $T$ is a function of the input $X$ , $I(T,X) = H(T) - H(T|X)$ . When discretizing continuous $T$ via binning, $H(T|X)$ is effectively zero, and $I(T,X)$ is approximated by the discrete entropy $H(T)$ of the binned activation vectors.
Adaptive KDE (aKDE): This method adds small Gaussian noise to the hidden activity to make the variable non-deterministic and thus allow for finite MI estimation using continuous variables. The critical adaptation here is that the variance of the added Gaussian noise is scaled proportionally to the maximum absolute activation value observed in the specific layer and epoch being measured. If $\sigma_0^2$ is a base noise variance, the layer-specific variance is $\sigma^2 = \sigma_0^2 \times \max(|T|)^2$ . This ensures that the noise level is consistent relative to the magnitude of the activations across different layers and epochs (Figure 3). Mutual information $I(\hat{T},X)$ for the noisy variable $\hat{T}=T+Z$ (where $Z$ is the noise) is estimated using KDE-based methods, often relying on approximating $I(\hat{T},X) \approx H(\hat{T}) - H(Z)$ .

Practical Implementation of Adaptive Estimators:

Implementing these methods involves several steps during the training process:

Capture Hidden Activations: At chosen intervals (e.g., every few epochs), perform a forward pass on the training dataset (or a representative subset) and store the activation values of each hidden layer for each input sample. For a network with $L$ layers and $N$ training samples, storing activations for one epoch for one layer with $U$ units would require storing an $N \times U$ matrix. Doing this across many epochs and layers can be memory intensive.
Offline Mutual Information Calculation: After training is complete and activations are saved, calculate the mutual information estimates offline.
Applying EBAB:
- For each layer and epoch:
- Flatten the activation matrix into a 1D array of all activation values for all units in that layer for all samples.
- Find the set of unique activation values.
- Sort the unique values.
- Determine bin edges by dividing the sorted unique values into num_bins quantiles (e.g., for 30 bins, find the values that divide the sorted unique list into 30 equal parts).
- Iterate through the original (non-unique) activation values and assign each value to a bin based on the determined edges.
- Treat the binned values as discrete symbols. Calculate the entropy of the distribution of these symbols across all units and samples in the layer for that epoch. This entropy is the $I(T,X)$ estimate for that layer/epoch.
Applying aKDE:
- For each layer and epoch:
- Flatten the activation matrix.
- Calculate the maximum absolute value of activations in the layer/epoch.
- Determine the noise variance $\sigma^2 = \sigma_0^2 \times (\max(|T|))^2$ .
- Generate Gaussian noise with this variance and the same shape as the activation matrix.
- Add the noise to the activation matrix element-wise.
- Use a KDE-based entropy or MI estimation algorithm on the noisy activations. The paper cites Kolchinsky & Tracey [Kolchinsky], which provides methods for estimating MI directly from pairwise distances in the noisy space, avoiding explicit density estimation or separate $H(\hat{T})$ and $H(Z)$ calculations.

The computational cost of MI calculation can be significant, especially for aKDE and large datasets or layers. Binning is generally faster but requires careful selection of the number of bins.

Key Findings and Their Practical Implications:

Using these adaptive estimators on a small feedforward network trained on a synthetic binary classification task, the paper reveals several key insights:

Compression with Non-Saturating Activations is Possible: Contrary to previous reports, networks with non-saturating functions like ReLU, ELU [elu], Swish [swish1], and centered softplus exhibit compression when measured with adaptive methods (Figure 4, Figure 5). This is a crucial finding for practitioners, confirming that the information dynamics predicted by IB theory are not limited to older, saturating activation functions and can be studied in modern architectures.
High Variability Across Initializations: Even with the same architecture and activation function (ReLU), different random weight initializations can lead to vastly different trajectories on the information plane, showing compression, fitting, or neither (Figure 4). This highlights the stochasticity of training and suggests that averaging over multiple initializations is necessary to understand typical behavior (Figure 5), a common practice in research but less often emphasized in practical model development unless robust performance across runs is critical.
L2 Regularization Induces Compression: Applying L2 regularization to network weights strongly encourages compression in ReLU networks that might otherwise show little. Furthermore, increased L2 penalty causes the mutual information values of different hidden layers to cluster together on the information plane (Figure 6). This suggests that L2 regularization's mechanism isn't just about preventing weights from becoming too large but actively shapes the information stored by layers, potentially by encouraging redundancy reduction or focusing representations. This provides a new perspective on L2 and might inspire regularization techniques designed explicitly to control information flow.
Only Last Layer Compression Correlates with Generalization: Using a simple compression metric (1 minus the ratio of final $I(T,X)$ to maximum $I(T,X)$ during training), the authors found no significant correlation between compression in hidden layers and generalization accuracy. However, the compression of the last softmax layer was positively correlated with generalization (Figure 7). This finding suggests that for this specific, simple task and network, the "information bottleneck" crucial for generalization might primarily occur at the final output stage, rather than uniformly across all hidden layers. For practitioners, this implies that analyzing or optimizing information dynamics might be most impactful for the network's final representation.

Limitations and Generalizability:

It is important to note the limitations acknowledged by the authors. The paper used a small, fully-connected network on a synthetic dataset. Applying these exact methods and conclusions directly to large-scale, complex models (like convolutional or transformer networks) on real-world tasks requires caution. The computational cost of saving and processing activations for MI calculation would be significantly higher. The simple compression metric might not fully capture the nuanced dynamics of compression in complex, hierarchical representations.

Despite these limitations, the paper provides valuable practical contributions by demonstrating how to robustly measure information dynamics in deep networks with modern activation functions and revealing the profound impact of initialization and L2 regularization on these dynamics. The finding about the last layer's compression being most relevant to generalization on this task offers a potential area of focus for future research and potentially for designing training objectives that specifically target the information structure of the final layer.

PDF Markdown

Adaptive Estimators Show Information Compression in Deep Neural Networks (1902.09037v2)

Summary

Related Papers