A Theory of Usable Information Under Computational Constraints (2002.10689v1)

Published 25 Feb 2020 in cs.LG and stat.ML

Abstract: We propose a new framework for reasoning about information in complex systems. Our foundation is based on a variational extension of Shannon's information theory that takes into account the modeling power and computational constraints of the observer. The resulting \emph{predictive $\mathcal{V}$-information} encompasses mutual information and other notions of informativeness such as the coefficient of determination. Unlike Shannon's mutual information and in violation of the data processing inequality, $\mathcal{V}$-information can be created through computation. This is consistent with deep neural networks extracting hierarchies of progressively more informative features in representation learning. Additionally, we show that by incorporating computational constraints, $\mathcal{V}$-information can be reliably estimated from data even in high dimensions with PAC-style guarantees. Empirically, we demonstrate predictive $\mathcal{V}$-information is more effective than mutual information for structure learning and fair representation learning.

Citations (157)

View on Semantic Scholar

Summary

The paper introduces predictive F-information to quantify usable information by accounting for computational limits, challenging traditional Shannon theory.
It derives PAC-style estimation bounds and demonstrates the framework’s effectiveness in structure learning, gene regulatory network inference, and video frame ordering.
Empirical results reveal that leveraging asymmetry and violating the data processing inequality can enhance representation learning and fairness in AI.

This paper introduces predictive $\mathcal{F}$ -information, a novel framework for quantifying information that explicitly incorporates the computational limitations and modeling capabilities of an observer. It argues that traditional Shannon information theory, while foundational, is insufficient for many AI and machine learning scenarios because it assumes computationally unbounded observers. For instance, encrypted data has high Shannon mutual information with the original plaintext, but low usable information for an observer without the decryption key or sufficient computational power.

The core idea is to define information relative to a predictive family $\mathcal{F}$ , which is a set of predictive models $f: \mathcal{X} \cup \{\emptyset\} \rightarrow \mathcal{P}(\mathcal{Y})$ that an observer is allowed to use to predict a target variable $Y$ given side information $X$ (or no side information $\emptyset$ ).

$\mathcal{F}$ -Entropy: The conditional $\mathcal{F}$ -entropy $H_{\mathcal{F}}(Y|X)$ is defined as the minimum expected negative log-likelihood achievable using models from $\mathcal{F}$ to predict $Y$ given $X$ : $H_{\mathcal{F}}(Y | X) = \inf_{f \in \mathcal{F}} \mathbb{E}_{x,y \sim X,Y}\left[-\log f[x](y)\right]$ . The marginal $\mathcal{F}$ -entropy $H_{\mathcal{F}}(Y)$ is defined similarly but without side information: $H_{\mathcal{F}}(Y) = \inf_{f \in \mathcal{F}} \mathbb{E}_{y \sim Y}\left[-\log f[\emptyset](y)\right]$ . A technical condition called "optional ignorance" ensures the observer can choose to ignore the side information $X$ .
Predictive $\mathcal{F}$ -information: Analogous to Shannon mutual information, predictive $\mathcal{F}$ -information measures the reduction in $\mathcal{F}$ -entropy when side information $X$ is provided: $I_{\mathcal{F}}(X \rightarrow Y) = H_{\mathcal{F}}(Y) - H_{\mathcal{F}}(Y | X)$ .

Key Properties and Special Cases:

Generalization: If $\mathcal{F}$ includes all possible predictive functions, $\mathcal{F}$ -information recovers Shannon mutual information.
Practical Measures: For specific choices of $\mathcal{F}$ $F$ :
- If $Y \in \mathbb{R}^d$ and $\mathcal{F}$ contains only Gaussian predictors $\mathcal{N}(\mu, \frac{1}{2}I)$ , then $H_{\mathcal{F}}(Y)$ is related to the trace of the covariance matrix of $Y$ .
- If $\mathcal{F}$ allows linear regression models $f[x] = \mathcal{N}(Wx+b, \frac{1}{2}I)$ , then $I_{\mathcal{F}}(X \rightarrow Y)$ corresponds to the (unnormalized) coefficient of determination ( $R^2$ ) multiplied by the trace of the covariance of $Y$ .
Non-Negativity: Like Shannon MI, $I_{\mathcal{F}}(X \rightarrow Y) \ge 0$ .
Data Processing Inequality Violation: Unlike Shannon MI, computation or preprocessing can increase $\mathcal{F}$ -information. For a function $t$ , it's possible that $I_{\mathcal{F}}(t(X) \rightarrow Y) > I_{\mathcal{F}}(X \rightarrow Y)$ . This aligns with the practice of representation learning, where feature extraction aims to make information more usable for downstream tasks (e.g., prediction with simpler models).
Asymmetry: $I_{\mathcal{F}}(X \rightarrow Y)$ is generally not equal to $I_{\mathcal{F}}(Y \rightarrow X)$ . This reflects real-world asymmetries, like one-way functions in cryptography or causal relationships (predicting effect from cause vs. cause from effect).

Estimation and Guarantees:

Shannon MI is notoriously hard to estimate reliably from samples, especially in high dimensions. Recent variational estimators (CPC, NWJ, MINE) have limitations like bias or high variance.
$\mathcal{F}$ -information can be estimated from data using empirical risk minimization: $\hat{I}_{\mathcal{F}}(X\to Y; \mathcal{D}) = \inf_{f \in \mathcal{F}} \frac{1}{N} \sum \left(-\log f[\emptyset](y_i)\right) - \inf_{f \in \mathcal{F}}\frac{1}{N} \sum \left(-\log f[x_i](y_i)\right)$ .
Crucially, the paper provides PAC-style bounds (Theorem 1) on the estimation error $|I_{\mathcal{F}} - \hat{I}_{\mathcal{F}}|$ , relating it to the Rademacher complexity of the function class $\mathcal{G}_{\mathcal{F}} = \{ (x,y) \mapsto \log f[x](y) \mid f \in \mathcal{F} \}$ . This means if $\mathcal{F}$ has bounded complexity (e.g., neural networks with bounded norms or specific architectures), $\mathcal{F}$ -information can be estimated reliably. A specific bound is derived for the linear regression case (Corollary 1).

Applications and Experiments:

Structure Learning (Chow-Liu Trees):
- The standard Chow-Liu algorithm finds the maximum weight spanning tree using Shannon MI as edge weights.
- The paper proposes using $\mathcal{F}$ -information. Since it's asymmetric, they find the maximum weight directed spanning tree (arborescence) using the Chu-Liu/Edmonds algorithm (Algorithm 1).
- Theorem 2 provides finite-sample guarantees for this algorithm, showing the weight of the learned tree is close to the optimal tree weight.
- Experiments on high-dimensional continuous data show the $\mathcal{F}$ -information approach significantly outperforms Chow-Liu with state-of-the-art MI estimators (CPC, NWJ, MINE) in terms of recovering the correct tree structure (Figure 1a).
Gene Regulatory Network Inference:
- Using $\mathcal{F}$ -information (with polynomial predictors) as a score for directed edges between genes outperforms various non-parametric MI estimators (KDE, KSG) on the DREAM5 benchmark, achieving higher AUC (Figure 1b). The asymmetry is beneficial here.
Video Frame Ordering:
- Using a conditional PixelCNN++ as $\mathcal{F}$ , the calculated $I_{\mathcal{F}}(X_i \rightarrow X_j)$ decreases with frame distance $|i-j|$ (Figure 1c).
- The directed tree algorithm successfully recovers the temporal order of frames in Moving-MNIST, even for deterministic dynamics where Shannon MI would fail.
Fair Representation Learning:
- The paper connects $\mathcal{F}$ -information to adversarial fairness methods, arguing they implicitly minimize $I_{\mathcal{F}}(Z \rightarrow U)$ where $Z$ is the representation, $U$ is the sensitive attribute, and $\mathcal{F}$ is related to the discriminator class.
- Experiments show that fairness learned against one class of adversary ( $\mathcal{F}_i$ ) may not generalize when tested against a different class ( $\mathcal{F}_j$ ), suggesting limitations in the robustness of existing fair representations (Appendix Figure 2b).

Conclusion:

The paper proposes $\mathcal{F}$ -information as a practical alternative to Shannon information when computational constraints are relevant. It captures the notion of "usable" information, exhibits distinct properties like violating the data processing inequality (justifying representation learning) and asymmetry, and crucially, allows for reliable estimation from data with theoretical guarantees. Empirical results across structure learning, gene network inference, video analysis, and fairness demonstrate its practical advantages over methods based on estimating Shannon mutual information.

PDF Markdown

Related Papers

Tweets

https://twitter.com/yacineMTB/status/1786397989222724032

https://twitter.com/jxmnop/status/1904238411399201278

https://twitter.com/Dragonmaurizio/status/1919021160357941254

https://twitter.com/BlackHC/status/1791539356345520297

https://twitter.com/LiorOnAI/status/1924871919599231297

https://twitter.com/CFGeek/status/1818720089816100915