- The paper introduces Shannon invariants that bypass traditional PID complexity by leveraging conventional Shannon entropies.
- The paper quantifies information distribution with average redundancy (r̄) and vulnerability (v̄) metrics, providing clear insights into neural network layers.
- The paper demonstrates that deeper layers and larger bottlenecks enhance redundancy and robustness, offering practical guidance for deep learning analysis.
This paper introduces "Shannon invariants" as a scalable alternative to traditional Partial Information Decomposition (PID) for analyzing how information is distributed and processed in complex systems like neural networks. Traditional PID aims to break down the mutual information between multiple source variables X=(X1,…,Xn) and a target variable Y into distinct "atoms" representing unique, redundant, and synergistic contributions. However, PID faces two major hurdles: multiple competing definitions for these atoms exist, and calculating the full decomposition becomes computationally infeasible for more than a few variables due to super-exponential complexity.
The core idea is to focus on specific linear combinations of these information atoms that can be calculated solely using standard Shannon entropies (and thus mutual information). These combinations are termed "Shannon invariants" because their values don't depend on the specific PID definition chosen. This approach bypasses both the ambiguity and the computational bottleneck of full PID.
Two key Shannon invariants are introduced:
- Average Degree of Redundancy (rˉ): This measures, on average, how many individual source variables Xi provide access to a piece of information about Y. It's calculated as:
rˉ=I(X;Y)∑i=1nI(Xi:Y)
A higher rˉ indicates that information is predominantly accessible through multiple individual sources (source-level redundancy). rˉ<1 implies the presence of source-level synergy (information only available from groups of sources).
- Average Degree of Vulnerability (vˉ): This measures, on average, how many individual source variables are strictly necessary to access a piece of information about Y. Removing any one of these necessary sources makes the information inaccessible. It's calculated as:
vˉ=I(X;Y)∑j=1nI(Xj;Y∣X−j)
where X−j represents all sources except Xj. A higher vˉ indicates that information often critically depends on specific individual sources, making it vulnerable. vˉ=0 corresponds to robust information (no single source is critical), while vˉ>1 implies a stronger form of synergy where information depends critically on multiple specific sources.
The paper highlights two distinct ways to conceptualize redundancy and synergy:
- Source-level: Based on access via single sources (r). Redundancy is r>1, synergy is r=0.
- Robustness/Vulnerability: Based on resilience to source removal (v). Robustness is v=0, vulnerability is v≥1.
These concepts are related: source-level redundancy (r>1) implies robustness (v=0), and vulnerability (v>1) implies source-level synergy (r=0).
The framework provides new interpretations for existing metrics and inspires new ones:
- Redundancy-Synergy Index (RSI): A known metric, RSI(X;Y)=∑I(Xi;Y)−I(X;Y), is shown to be directly related to the average degree of redundancy:
RSI(X;Y)=(rˉ−1)I(X;Y)
This confirms the intuition that RSI > 0 indicates dominance of source-level redundancy (rˉ>1) and RSI < 0 indicates dominance of source-level synergy (rˉ<1). It emphasizes higher-order redundancies more strongly.
- Dual Redundancy-Synergy Index (DRSI): A novel metric proposed in analogy to RSI, focusing on the robustness/vulnerability perspective:
DRSI(X;Y)=I(X;Y)−∑I(Xj;Y∣X−j)=(1−vˉ)I(X;Y)
DRSI > 0 indicates dominance of robustness (vˉ<1), while DRSI < 0 indicates dominance of vulnerability (vˉ>1). It emphasizes higher-order vulnerabilities more strongly.
Practical Implementation & Application:
The primary advantage of Shannon invariants is their scalability. Calculating rˉ requires computing n individual mutual informations I(Xi;Y) and one joint mutual information I(X;Y). Calculating vˉ requires n conditional mutual informations I(Xj;Y∣X−j) and one joint mutual information I(X;Y). While estimating high-dimensional (conditional) mutual information can still be challenging, the computational complexity related to the number of sources scales linearly (O(n) terms to compute), a significant improvement over the super-exponential scaling of full PID.
The paper demonstrates the utility of these invariants by analyzing deep learning models:
- Methodology: To apply information theory to deterministic neural networks, the authors use networks with activation values stochastically quantized to 8 levels. This breaks the injectivity and makes information measures well-defined. The analysis is performed on the network's activations in response to the training set, treated as a complete population, thus calculating exact information values for that dataset rather than estimating for an underlying distribution.
- Feedforward MNIST Classifier: They analyzed a 5-hidden-layer network.
- rˉ increased across successive hidden layers (especially equally-sized ones) and over training epochs.
- vˉ decreased across layers and over training.
- Interpretation: Deeper layers and trained networks tend to encode information about the class label more redundantly (accessible from multiple neurons) and more robustly (less dependent on any single neuron).
- Convolutional Face Autoencoder: They analyzed an encoder-bottleneck-decoder architecture.
- rˉ increased during initial training. Decoder layers generally showed higher rˉ than size-matched encoder layers.
- In the bottleneck layer, rˉ increased with bottleneck size (nb), while vˉ decreased with bottleneck size.
- Interpretation: Larger bottleneck capacity allows for more redundant and robust encoding of the input image information. Constrained bottlenecks lead to less redundant, more vulnerable encodings.
Summary for Practitioners:
Shannon invariants (rˉ and vˉ) offer a computationally feasible way to quantify average redundancy and synergy/vulnerability in high-dimensional systems like neural network layers, without needing to commit to a specific PID definition. They require calculating standard (conditional) mutual information terms, scaling linearly with the number of source components (e.g., neurons in a layer).
- Use rˉ=(∑I(Xi;Y))/I(X;Y) to measure average source-level redundancy.
- Use vˉ=(∑I(Xj;Y∣X−j))/I(X;Y) to measure average vulnerability/robustness.
- When analyzing deterministic NNs, consider quantizing activations.
- Calculations can be done directly on the training dataset distribution.
- These measures can track how information representation changes across layers, during training, or with architectural changes (like bottleneck size), providing insights into network function and learning dynamics (e.g., whether representations become more distributed/robust).