Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Theory of Usable Information Under Computational Constraints (2002.10689v1)

Published 25 Feb 2020 in cs.LG and stat.ML

Abstract: We propose a new framework for reasoning about information in complex systems. Our foundation is based on a variational extension of Shannon's information theory that takes into account the modeling power and computational constraints of the observer. The resulting \emph{predictive $\mathcal{V}$-information} encompasses mutual information and other notions of informativeness such as the coefficient of determination. Unlike Shannon's mutual information and in violation of the data processing inequality, $\mathcal{V}$-information can be created through computation. This is consistent with deep neural networks extracting hierarchies of progressively more informative features in representation learning. Additionally, we show that by incorporating computational constraints, $\mathcal{V}$-information can be reliably estimated from data even in high dimensions with PAC-style guarantees. Empirically, we demonstrate predictive $\mathcal{V}$-information is more effective than mutual information for structure learning and fair representation learning.

Citations (157)

Summary

  • The paper introduces predictive F-information to quantify usable information by accounting for computational limits, challenging traditional Shannon theory.
  • It derives PAC-style estimation bounds and demonstrates the framework’s effectiveness in structure learning, gene regulatory network inference, and video frame ordering.
  • Empirical results reveal that leveraging asymmetry and violating the data processing inequality can enhance representation learning and fairness in AI.

This paper introduces predictive F\mathcal{F}-information, a novel framework for quantifying information that explicitly incorporates the computational limitations and modeling capabilities of an observer. It argues that traditional Shannon information theory, while foundational, is insufficient for many AI and machine learning scenarios because it assumes computationally unbounded observers. For instance, encrypted data has high Shannon mutual information with the original plaintext, but low usable information for an observer without the decryption key or sufficient computational power.

The core idea is to define information relative to a predictive family F\mathcal{F}, which is a set of predictive models f:X{}P(Y)f: \mathcal{X} \cup \{\emptyset\} \rightarrow \mathcal{P}(\mathcal{Y}) that an observer is allowed to use to predict a target variable YY given side information XX (or no side information \emptyset).

  1. F\mathcal{F}-Entropy: The conditional F\mathcal{F}-entropy HF(YX)H_{\mathcal{F}}(Y|X) is defined as the minimum expected negative log-likelihood achievable using models from F\mathcal{F} to predict YY given XX: HF(YX)=inffFEx,yX,Y[logf[x](y)]H_{\mathcal{F}}(Y | X) = \inf_{f \in \mathcal{F}} \mathbb{E}_{x,y \sim X,Y}\left[-\log f[x](y)\right]. The marginal F\mathcal{F}-entropy HF(Y)H_{\mathcal{F}}(Y) is defined similarly but without side information: HF(Y)=inffFEyY[logf[](y)]H_{\mathcal{F}}(Y) = \inf_{f \in \mathcal{F}} \mathbb{E}_{y \sim Y}\left[-\log f[\emptyset](y)\right]. A technical condition called "optional ignorance" ensures the observer can choose to ignore the side information XX.
  2. Predictive F\mathcal{F}-information: Analogous to Shannon mutual information, predictive F\mathcal{F}-information measures the reduction in F\mathcal{F}-entropy when side information XX is provided: IF(XY)=HF(Y)HF(YX)I_{\mathcal{F}}(X \rightarrow Y) = H_{\mathcal{F}}(Y) - H_{\mathcal{F}}(Y | X).

Key Properties and Special Cases:

  • Generalization: If F\mathcal{F} includes all possible predictive functions, F\mathcal{F}-information recovers Shannon mutual information.
  • Practical Measures: For specific choices of F\mathcal{F}:
    • If YRdY \in \mathbb{R}^d and F\mathcal{F} contains only Gaussian predictors N(μ,12I)\mathcal{N}(\mu, \frac{1}{2}I), then HF(Y)H_{\mathcal{F}}(Y) is related to the trace of the covariance matrix of YY.
    • If F\mathcal{F} allows linear regression models f[x]=N(Wx+b,12I)f[x] = \mathcal{N}(Wx+b, \frac{1}{2}I), then IF(XY)I_{\mathcal{F}}(X \rightarrow Y) corresponds to the (unnormalized) coefficient of determination (R2R^2) multiplied by the trace of the covariance of YY.
  • Non-Negativity: Like Shannon MI, IF(XY)0I_{\mathcal{F}}(X \rightarrow Y) \ge 0.
  • Data Processing Inequality Violation: Unlike Shannon MI, computation or preprocessing can increase F\mathcal{F}-information. For a function tt, it's possible that IF(t(X)Y)>IF(XY)I_{\mathcal{F}}(t(X) \rightarrow Y) > I_{\mathcal{F}}(X \rightarrow Y). This aligns with the practice of representation learning, where feature extraction aims to make information more usable for downstream tasks (e.g., prediction with simpler models).
  • Asymmetry: IF(XY)I_{\mathcal{F}}(X \rightarrow Y) is generally not equal to IF(YX)I_{\mathcal{F}}(Y \rightarrow X). This reflects real-world asymmetries, like one-way functions in cryptography or causal relationships (predicting effect from cause vs. cause from effect).

Estimation and Guarantees:

  • Shannon MI is notoriously hard to estimate reliably from samples, especially in high dimensions. Recent variational estimators (CPC, NWJ, MINE) have limitations like bias or high variance.
  • F\mathcal{F}-information can be estimated from data using empirical risk minimization: I^F(XY;D)=inffF1N(logf[](yi))inffF1N(logf[xi](yi))\hat{I}_{\mathcal{F}}(X\to Y; \mathcal{D}) = \inf_{f \in \mathcal{F}} \frac{1}{N} \sum \left(-\log f[\emptyset](y_i)\right) - \inf_{f \in \mathcal{F}}\frac{1}{N} \sum \left(-\log f[x_i](y_i)\right).
  • Crucially, the paper provides PAC-style bounds (Theorem 1) on the estimation error IFI^F|I_{\mathcal{F}} - \hat{I}_{\mathcal{F}}|, relating it to the Rademacher complexity of the function class GF={(x,y)logf[x](y)fF}\mathcal{G}_{\mathcal{F}} = \{ (x,y) \mapsto \log f[x](y) \mid f \in \mathcal{F} \}. This means if F\mathcal{F} has bounded complexity (e.g., neural networks with bounded norms or specific architectures), F\mathcal{F}-information can be estimated reliably. A specific bound is derived for the linear regression case (Corollary 1).

Applications and Experiments:

  1. Structure Learning (Chow-Liu Trees):
    • The standard Chow-Liu algorithm finds the maximum weight spanning tree using Shannon MI as edge weights.
    • The paper proposes using F\mathcal{F}-information. Since it's asymmetric, they find the maximum weight directed spanning tree (arborescence) using the Chu-Liu/Edmonds algorithm (Algorithm 1).
    • Theorem 2 provides finite-sample guarantees for this algorithm, showing the weight of the learned tree is close to the optimal tree weight.
    • Experiments on high-dimensional continuous data show the F\mathcal{F}-information approach significantly outperforms Chow-Liu with state-of-the-art MI estimators (CPC, NWJ, MINE) in terms of recovering the correct tree structure (Figure 1a).
  2. Gene Regulatory Network Inference:
    • Using F\mathcal{F}-information (with polynomial predictors) as a score for directed edges between genes outperforms various non-parametric MI estimators (KDE, KSG) on the DREAM5 benchmark, achieving higher AUC (Figure 1b). The asymmetry is beneficial here.
  3. Video Frame Ordering:
    • Using a conditional PixelCNN++ as F\mathcal{F}, the calculated IF(XiXj)I_{\mathcal{F}}(X_i \rightarrow X_j) decreases with frame distance ij|i-j| (Figure 1c).
    • The directed tree algorithm successfully recovers the temporal order of frames in Moving-MNIST, even for deterministic dynamics where Shannon MI would fail.
  4. Fair Representation Learning:
    • The paper connects F\mathcal{F}-information to adversarial fairness methods, arguing they implicitly minimize IF(ZU)I_{\mathcal{F}}(Z \rightarrow U) where ZZ is the representation, UU is the sensitive attribute, and F\mathcal{F} is related to the discriminator class.
    • Experiments show that fairness learned against one class of adversary (Fi\mathcal{F}_i) may not generalize when tested against a different class (Fj\mathcal{F}_j), suggesting limitations in the robustness of existing fair representations (Appendix Figure 2b).

Conclusion:

The paper proposes F\mathcal{F}-information as a practical alternative to Shannon information when computational constraints are relevant. It captures the notion of "usable" information, exhibits distinct properties like violating the data processing inequality (justifying representation learning) and asymmetry, and crucially, allows for reliable estimation from data with theoretical guarantees. Empirical results across structure learning, gene network inference, video analysis, and fairness demonstrate its practical advantages over methods based on estimating Shannon mutual information.