- The paper introduces predictive F-information to quantify usable information by accounting for computational limits, challenging traditional Shannon theory.
- It derives PAC-style estimation bounds and demonstrates the framework’s effectiveness in structure learning, gene regulatory network inference, and video frame ordering.
- Empirical results reveal that leveraging asymmetry and violating the data processing inequality can enhance representation learning and fairness in AI.
This paper introduces predictive F-information, a novel framework for quantifying information that explicitly incorporates the computational limitations and modeling capabilities of an observer. It argues that traditional Shannon information theory, while foundational, is insufficient for many AI and machine learning scenarios because it assumes computationally unbounded observers. For instance, encrypted data has high Shannon mutual information with the original plaintext, but low usable information for an observer without the decryption key or sufficient computational power.
The core idea is to define information relative to a predictive family F, which is a set of predictive models f:X∪{∅}→P(Y) that an observer is allowed to use to predict a target variable Y given side information X (or no side information ∅).
- F-Entropy: The conditional F-entropy HF(Y∣X) is defined as the minimum expected negative log-likelihood achievable using models from F to predict Y given X:
HF(Y∣X)=f∈FinfEx,y∼X,Y[−logf[x](y)].
The marginal F-entropy HF(Y) is defined similarly but without side information: HF(Y)=f∈FinfEy∼Y[−logf[∅](y)]. A technical condition called "optional ignorance" ensures the observer can choose to ignore the side information X.
- Predictive F-information: Analogous to Shannon mutual information, predictive F-information measures the reduction in F-entropy when side information X is provided:
IF(X→Y)=HF(Y)−HF(Y∣X).
Key Properties and Special Cases:
- Generalization: If F includes all possible predictive functions, F-information recovers Shannon mutual information.
- Practical Measures: For specific choices of F:
- If Y∈Rd and F contains only Gaussian predictors N(μ,21I), then HF(Y) is related to the trace of the covariance matrix of Y.
- If F allows linear regression models f[x]=N(Wx+b,21I), then IF(X→Y) corresponds to the (unnormalized) coefficient of determination (R2) multiplied by the trace of the covariance of Y.
- Non-Negativity: Like Shannon MI, IF(X→Y)≥0.
- Data Processing Inequality Violation: Unlike Shannon MI, computation or preprocessing can increase F-information. For a function t, it's possible that IF(t(X)→Y)>IF(X→Y). This aligns with the practice of representation learning, where feature extraction aims to make information more usable for downstream tasks (e.g., prediction with simpler models).
- Asymmetry: IF(X→Y) is generally not equal to IF(Y→X). This reflects real-world asymmetries, like one-way functions in cryptography or causal relationships (predicting effect from cause vs. cause from effect).
Estimation and Guarantees:
- Shannon MI is notoriously hard to estimate reliably from samples, especially in high dimensions. Recent variational estimators (CPC, NWJ, MINE) have limitations like bias or high variance.
- F-information can be estimated from data using empirical risk minimization:
I^F(X→Y;D)=f∈FinfN1∑(−logf[∅](yi))−f∈FinfN1∑(−logf[xi](yi)).
- Crucially, the paper provides PAC-style bounds (Theorem 1) on the estimation error ∣IF−I^F∣, relating it to the Rademacher complexity of the function class GF={(x,y)↦logf[x](y)∣f∈F}. This means if F has bounded complexity (e.g., neural networks with bounded norms or specific architectures), F-information can be estimated reliably. A specific bound is derived for the linear regression case (Corollary 1).
Applications and Experiments:
- Structure Learning (Chow-Liu Trees):
- The standard Chow-Liu algorithm finds the maximum weight spanning tree using Shannon MI as edge weights.
- The paper proposes using F-information. Since it's asymmetric, they find the maximum weight directed spanning tree (arborescence) using the Chu-Liu/Edmonds algorithm (Algorithm 1).
- Theorem 2 provides finite-sample guarantees for this algorithm, showing the weight of the learned tree is close to the optimal tree weight.
- Experiments on high-dimensional continuous data show the F-information approach significantly outperforms Chow-Liu with state-of-the-art MI estimators (CPC, NWJ, MINE) in terms of recovering the correct tree structure (Figure 1a).
- Gene Regulatory Network Inference:
- Using F-information (with polynomial predictors) as a score for directed edges between genes outperforms various non-parametric MI estimators (KDE, KSG) on the DREAM5 benchmark, achieving higher AUC (Figure 1b). The asymmetry is beneficial here.
- Video Frame Ordering:
- Using a conditional PixelCNN++ as F, the calculated IF(Xi→Xj) decreases with frame distance ∣i−j∣ (Figure 1c).
- The directed tree algorithm successfully recovers the temporal order of frames in Moving-MNIST, even for deterministic dynamics where Shannon MI would fail.
- Fair Representation Learning:
- The paper connects F-information to adversarial fairness methods, arguing they implicitly minimize IF(Z→U) where Z is the representation, U is the sensitive attribute, and F is related to the discriminator class.
- Experiments show that fairness learned against one class of adversary (Fi) may not generalize when tested against a different class (Fj), suggesting limitations in the robustness of existing fair representations (Appendix Figure 2b).
Conclusion:
The paper proposes F-information as a practical alternative to Shannon information when computational constraints are relevant. It captures the notion of "usable" information, exhibits distinct properties like violating the data processing inequality (justifying representation learning) and asymmetry, and crucially, allows for reliable estimation from data with theoretical guarantees. Empirical results across structure learning, gene network inference, video analysis, and fairness demonstrate its practical advantages over methods based on estimating Shannon mutual information.