Shannon invariants: A scalable approach to information decomposition (2504.15779v1)

Published 22 Apr 2025 in cs.IT, cs.AI, cs.LG, math.IT, nlin.AO, and physics.data-an

Abstract: Distributed systems, such as biological and artificial neural networks, process information via complex interactions engaging multiple subsystems, resulting in high-order patterns with distinct properties across scales. Investigating how these systems process information remains challenging due to difficulties in defining appropriate multivariate metrics and ensuring their scalability to large systems. To address these challenges, we introduce a novel framework based on what we call "Shannon invariants" -- quantities that capture essential properties of high-order information processing in a way that depends only on the definition of entropy and can be efficiently calculated for large systems. Our theoretical results demonstrate how Shannon invariants can be used to resolve long-standing ambiguities regarding the interpretation of widely used multivariate information-theoretic measures. Moreover, our practical results reveal distinctive information-processing signatures of various deep learning architectures across layers, which lead to new insights into how these systems process information and how this evolves during training. Overall, our framework resolves fundamental limitations in analyzing high-order phenomena and offers broad opportunities for theoretical developments and empirical analyses.

Summary

The paper introduces Shannon invariants that bypass traditional PID complexity by leveraging conventional Shannon entropies.
The paper quantifies information distribution with average redundancy (r̄) and vulnerability (v̄) metrics, providing clear insights into neural network layers.
The paper demonstrates that deeper layers and larger bottlenecks enhance redundancy and robustness, offering practical guidance for deep learning analysis.

This paper introduces "Shannon invariants" as a scalable alternative to traditional Partial Information Decomposition (PID) for analyzing how information is distributed and processed in complex systems like neural networks. Traditional PID aims to break down the mutual information between multiple source variables $\bm X = (X_1, \dots, X_n)$ and a target variable $Y$ into distinct "atoms" representing unique, redundant, and synergistic contributions. However, PID faces two major hurdles: multiple competing definitions for these atoms exist, and calculating the full decomposition becomes computationally infeasible for more than a few variables due to super-exponential complexity.

The core idea is to focus on specific linear combinations of these information atoms that can be calculated solely using standard Shannon entropies (and thus mutual information). These combinations are termed "Shannon invariants" because their values don't depend on the specific PID definition chosen. This approach bypasses both the ambiguity and the computational bottleneck of full PID.

Two key Shannon invariants are introduced:

Average Degree of Redundancy ( $\bar{r}$ ): This measures, on average, how many individual source variables $X_i$ provide access to a piece of information about $Y$ . It's calculated as:

$\bar{r} = \frac{\sum_{i=1}^n I(X_i:Y)}{I(\bm X;Y)}$

A higher $\bar{r}$ indicates that information is predominantly accessible through multiple individual sources (source-level redundancy). $\bar{r} < 1$ implies the presence of source-level synergy (information only available from groups of sources).
Average Degree of Vulnerability ( $\bar{v}$ ): This measures, on average, how many individual source variables are strictly necessary to access a piece of information about $Y$ . Removing any one of these necessary sources makes the information inaccessible. It's calculated as:

$\bar{v} = \frac{\sum_{j=1}^n I(X_j;Y|\bm X_{-j})}{I(\bm X;Y)}$

where $\bm X_{-j}$ represents all sources except $X_j$ . A higher $\bar{v}$ indicates that information often critically depends on specific individual sources, making it vulnerable. $\bar{v} = 0$ corresponds to robust information (no single source is critical), while $\bar{v} > 1$ implies a stronger form of synergy where information depends critically on multiple specific sources.

The paper highlights two distinct ways to conceptualize redundancy and synergy:

Source-level: Based on access via single sources ( $r$ ). Redundancy is $r>1$ , synergy is $r=0$ .
Robustness/Vulnerability: Based on resilience to source removal ( $v$ ). Robustness is $v=0$ , vulnerability is $v \geq 1$ .

These concepts are related: source-level redundancy ( $r>1$ ) implies robustness ( $v=0$ ), and vulnerability ( $v>1$ ) implies source-level synergy ( $r=0$ ).

The framework provides new interpretations for existing metrics and inspires new ones:

Redundancy-Synergy Index (RSI): A known metric, $\text{RSI}(\bm X;Y) = \sum I(X_i;Y) - I(\bm X;Y)$ , is shown to be directly related to the average degree of redundancy:

$\text{RSI}(\bm X;Y) = (\bar{r} - 1) I(\bm X;Y)$

This confirms the intuition that RSI > 0 indicates dominance of source-level redundancy ( $\bar{r}>1$ ) and RSI < 0 indicates dominance of source-level synergy ( $\bar{r}<1$ ). It emphasizes higher-order redundancies more strongly.
Dual Redundancy-Synergy Index (DRSI): A novel metric proposed in analogy to RSI, focusing on the robustness/vulnerability perspective:

$\text{DRSI}(\bm X;Y) = I(\bm X;Y) - \sum I(X_j;Y|\bm X_{-j}) = (1 - \bar{v}) I(\bm X;Y)$

DRSI > 0 indicates dominance of robustness ( $\bar{v}<1$ ), while DRSI < 0 indicates dominance of vulnerability ( $\bar{v}>1$ ). It emphasizes higher-order vulnerabilities more strongly.

Practical Implementation & Application:

The primary advantage of Shannon invariants is their scalability. Calculating $\bar{r}$ requires computing $n$ individual mutual informations $I(X_i;Y)$ and one joint mutual information $I(\bm X;Y)$ . Calculating $\bar{v}$ requires $n$ conditional mutual informations $I(X_j;Y|\bm X_{-j})$ and one joint mutual information $I(\bm X;Y)$ . While estimating high-dimensional (conditional) mutual information can still be challenging, the computational complexity related to the number of sources scales linearly ( $O(n)$ terms to compute), a significant improvement over the super-exponential scaling of full PID.

The paper demonstrates the utility of these invariants by analyzing deep learning models:

Methodology: To apply information theory to deterministic neural networks, the authors use networks with activation values stochastically quantized to 8 levels. This breaks the injectivity and makes information measures well-defined. The analysis is performed on the network's activations in response to the training set, treated as a complete population, thus calculating exact information values for that dataset rather than estimating for an underlying distribution.
Feedforward MNIST Classifier: They analyzed a 5-hidden-layer network.
- $\bar{r}$ increased across successive hidden layers (especially equally-sized ones) and over training epochs.
- $\bar{v}$ decreased across layers and over training.
- Interpretation: Deeper layers and trained networks tend to encode information about the class label more redundantly (accessible from multiple neurons) and more robustly (less dependent on any single neuron).
Convolutional Face Autoencoder: They analyzed an encoder-bottleneck-decoder architecture.
- $\bar{r}$ increased during initial training. Decoder layers generally showed higher $\bar{r}$ than size-matched encoder layers.
- In the bottleneck layer, $\bar{r}$ increased with bottleneck size ( $n_b$ ), while $\bar{v}$ decreased with bottleneck size.
- Interpretation: Larger bottleneck capacity allows for more redundant and robust encoding of the input image information. Constrained bottlenecks lead to less redundant, more vulnerable encodings.

Summary for Practitioners:

Shannon invariants ( $\bar{r}$ and $\bar{v}$ ) offer a computationally feasible way to quantify average redundancy and synergy/vulnerability in high-dimensional systems like neural network layers, without needing to commit to a specific PID definition. They require calculating standard (conditional) mutual information terms, scaling linearly with the number of source components (e.g., neurons in a layer).

Use $\bar{r} = (\sum I(X_i;Y)) / I(\bm X;Y)$ to measure average source-level redundancy.
Use $\bar{v} = (\sum I(X_j;Y|\bm X_{-j})) / I(\bm X;Y)$ to measure average vulnerability/robustness.
When analyzing deterministic NNs, consider quantizing activations.
Calculations can be done directly on the training dataset distribution.
These measures can track how information representation changes across layers, during training, or with architectural changes (like bottleneck size), providing insights into network function and learning dynamics (e.g., whether representations become more distributed/robust).

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/_fernando_rosas/status/1915037600341008386

https://twitter.com/arxivsanitybot/status/1915401427058311203