Information-Theoretic Probing

Updated 10 November 2025

Information-theoretic probing is a family of methods that quantifies and interprets the information content in learned representations using mutual information, entropy, and variational bounds.
The approach is applied in diverse domains such as NLP, cryptography, graph learning, and molecular modeling to rigorously estimate and compare information extraction capabilities.
Researchers leverage techniques like variational lower bounds, MDL probing, and Bayesian mutual information to assess probe capacity and isolate context-specific information.

Information-theoretic probing comprises a family of methods for quantifying and interpreting the information content and extractability of linguistic, structural, or functional properties in learned representations, typically using tools from information theory: mutual information, entropy, cross-entropy, and related estimators. Across neural language processing, cryptography, graph learning, and molecular modeling, information-theoretic probing operationalizes probing not just as diagnostic classification, but as an explicit, mathematically grounded optimization or estimation of information-theoretic quantities—thereby addressing longstanding ambiguities in probe selection, probe capacity, result interpretation, and comparative evaluation with baselines and controls.

1. Fundamental Principles and Objectives

Information-theoretic probing is anchored in quantifying the mutual information (MI) between a representation $R$ (e.g., a neural embedding) and a property $T$ of interest (e.g., POS tags, graph structures, or molecular configurations), as $I(T;R)$ . The mutual information, defined as

$I(T;R) = H(T) - H(T|R)$

captures the reduction in uncertainty about $T$ upon observing $R$ . Any deterministic or stochastic mapping of $R$ (including control functions or dimensionality reduction) can only reduce $I(T;R)$ due to the data-processing inequality.

Given practical intractability of computing true $p(t|r)$ , the central paradigm is to estimate $I(T;R)$ via lower bounds, operationalized by training a parametric probe (e.g., linear, MLP) $q_\theta(t|r)$ to minimize cross-entropy. The key estimator is: $\widehat I_q(T;R) = H(T) - \mathbb{E}_{p(t,r)}[-\log q_\theta(t|r)]$ which converges to the true $I(T;R)$ as $q_\theta$ approaches the optimal.

Control baselines, where $R$ is replaced by a type-level or random representation $c(R)$ , yield reference bounds $\widehat I_q(T;c(R))$ , enabling quantification of the contextual or additional information specifically encoded beyond shallow baselines.

2. Probing Algorithms and Methodological Frameworks

Several concrete estimators, frameworks, and variants have become central to information-theoretic probing:

2.1 Variational Lower Bounds (MINE, InfoNCE)

The Donsker–Varadhan bound and its empirical neural estimator (MINE) underpin the variational lower bounds:

$\hat I_\theta(V;W) = \mathbb{E}_{P_{VW}}[T_\theta(V,W)] - \log \mathbb{E}_{P_V \otimes P_W}[e^{T_\theta(V,W)}]$
In classification probing, standard cross-entropy minimization with a softmax output is equivalent to maximizing a variational lower bound on $I(Y;R)$ for label $Y$ and representation $R$ .

2.2 Probe Selection and Control Mechanisms

Selecting a higher-capacity probe (e.g., deeper MLPs, fine-tuning) always tightens the lower bound on MI by reducing the KL divergence $\mathbb{E}_{p(r)}\mathrm{KL}(p(\cdot|r)||q(\cdot|r))$ .
Constraining probe capacity (e.g., to linear functions) yields systematically looser, and sometimes misleading, lower bounds.
Control tasks and control functions—randomizing the target labels or representation—provide baselines to isolate the probe's capacity from the actual information encoded (Zhu et al., 2020). Both approaches are algebraically equivalent up to task-dependent constants under identical probe-training regimes.

2.3 Conditional Probing and Usable Information

Conditional probing extends "baselined" approaches, directly estimating the information about a property $Y$ in $Z$ (the target representation) that is not explainable by a baseline $B$ (e.g., non-contextual embeddings) (Hewitt et al., 2021).
The core estimator is the conditional $\mathcal{V}$ -information:

$I_\mathcal{V}(Z \to Y | B) = H_\mathcal{V}(Y|B) - H_\mathcal{V}(Y|B,Z)$

which is robustly computable by contrasting probe performance with and without $Z$ given $B$ .

2.4 Minimum Description Length (MDL) Probing

MDL probing recasts probing as a compression problem: how many bits are needed (given a representation) to transmit the labels, given an optimal probe.
Both variational and online coding schemes result in MDL bounds that subsume cross-entropy losses but penalize model complexity and data usage directly (Voita et al., 2020).
MDL-based ranks and comparisons are stable across probe hyperparameters, seeds, and control tasks.

2.5 Bayesian Mutual Information Probing

Bayesian MI is defined as $I(Y \to X | D) = H(X|D) - H(X|Y,D)$ , where all distributions are posterior predictive for a Bayesian agent after observing a dataset $D$ (Pimentel et al., 2021).
Unlike classical MI, Bayesian MI is data-dependent, may increase as data grows, and can reflect "ease of extraction" of information by realistically limited agents.
Probing learning curves, $I(R \to T|D)$ as a function of $|D|$ , quantify which representations enable rapid information extraction with finite data.

3. Empirical Findings and Application Domains

The information-theoretic probing paradigm has been applied across multiple domains:

3.1 Linguistic Structure in NLP

For tasks such as POS tagging and dependency labeling (Universal Dependencies 2.5) with BERT representations:
- Unconditional entropy $H(T)\approx 3.0$ –$3.6$ (POS); $3.6$–$4.5$ (dependencies).
- BERT reduces $H(T|R)$ to $0.10$–$0.76$ (POS), but word-type baselines (fastText, one-hot) already attain $H(T|c(R))\approx0.11$ –$0.90$.
- Contextual embedding gains above type-level baselines are $G\approx-0.06$ to $0.27$ bits (POS), and up to $0.55$ bits (dependencies), i.e., ≤12% of $H(T)$ .
- Most part-of-speech information is captured by word identity alone; context adds little. Dependency labeling sees moderately higher gains, but still limited (Pimentel et al., 2020).

3.2 Structural Graph Probing

The "Bird's Eye" and "Worm's Eye" mutual-information probes estimate how much of a linguistic graph (e.g., dependency parse or AMR) is encoded by sentence representations.
MI is estimated via a variational lower bound; information contributed by specific graph substructures is localized by controlled perturbation (MIL analysis) (Hou et al., 2021).

3.3 Cryptography and Masking Schemes

In white-box cryptographic circuits, probing security (PS( $q$ )) is defined as $I(X;Y_{i_1},...,Y_{i_q})=0$ for any $q$ wire picks, using mutual information.
The cardinal result is that for a linear code, PS( $q$ ) is satisfied if and only if any $q$ columns of the probing matrix are $\mathbb{F}_2$ -linearly independent (0907.4273).
There is a duality between information-theoretic probing security and error-detecting codes (fault security); optimal tamper-resistant codes simultaneously ensure both privacy and integrity via code design.

3.4 Molecular and RNA Structural Biology

Probing RNA secondary structures via adaptive, entropy-maximizing queries ("ensemble tree" construction) based on binary bit-queries (e.g., is a base pair present?) enables localization of a high-probability structure within the Boltzmann ensemble with >90% accuracy in ~11 queries (Li et al., 2019).
The information-theoretic query selection maximally reduces ensemble entropy at each step (up to 1 bit per query), strictly outperforming classical SHAPE-based folding protocols.

4. Theoretical Properties and Guarantees

Information-theoretic probing frameworks possess several robust, theoretically grounded properties:

Lower-bounding and Tightness: Variational or cross-entropy-based estimators provide a lower bound on true MI, with the bound tightened by increasing probe capacity. There is no systematic penalty for using high-capacity probes; restricting probe complexity can substantially underestimate encoded information (Pimentel et al., 2020).
Conditional MI and Usable Information: Conditioning on baseline representations or prior knowledge isolates the additional information present, discounting the confound of "trivial" features (Hewitt et al., 2021). Conditional MI is nonnegative, monotonic in probe capacity, and collapses under statistical independence.
Stability and Model Selection: MDL and conditional/usable-information-based probes offer rankings and results that are robust to probe class, parameterization, and random seed. Probe selection should prioritize stable discrimination over controls or baselines, emphasizing data efficiency and inductive bias.
Practicality: Direct estimation of $I(T;R)$ is infeasible for high-dimensional neural representations; all methodologies rely on proxies, variational bounds, or data-adaptive compression.

5. Limitations, Controversies, and Best Practices

Several foundational limitations and nuances arise:

Probing Reveals Language, Not Representation: Under (approximate) invertibility, $I(T;\mathrm{bert}(S))=I(T;S)$ —probing thus reveals the information content of the language rather than uniquely the representation (Pimentel et al., 2020), making strong claims about learned structure representation delicate.
Unidentifiability of Cause: High probe performance can reflect either true information in the representation or capacity of the probe to "learn the task"—a canonical dichotomy (Zhu et al., 2020).
Classical MI May Obscure Extraction Difficulty: Classical mutual information does not reflect the "ease of extraction" for finite-data, realistically constrained agents; Bayesian MI and MDL provide alternatives that directly measure this aspect (Voita et al., 2020, Pimentel et al., 2021).
Probe Family Dependence: Estimates of $\mathcal{V}$ -information or MDL are contingent on the expressivity of the probe family or compression model; choice of $\mathcal{V}$ is both a necessity and a potential source of arbitrariness (Hewitt et al., 2021).
Empirical Instabilities: Small sample sizes, overfitting, or poorly tuned baselines can distort probe-based MI estimates; recommendations include early stopping, weight decay, fixed randomizations for control settings, and comprehensive cross-validation (Zhu et al., 2020).

Best practices derived across these studies mandate:

Always select the highest-capacity probe feasible, but report results on controls and with baselines.
Use information-gain or conditional information (over a baseline) to specifically quantify contextual or structure-specific encoding.
Combine MI-based metrics, MDL, and learning curves to triangulate real differences between representations.
Where feasible, report both raw (cross-entropy, MI bound) and normalized (e.g., gains over random or type-level baselines) results.

6. Future Directions

Research trajectories in information-theoretic probing include:

Formalizing "ease of extraction" using Bayesian, MDL, or learning-curve-based information measures to reflect realistic downstream usability.
Extending conditional/usable-information probing to support multiple baselines, hierarchical and structured outputs, and multimodal learning settings.
Developing richer probe families (e.g., kernels, attention mechanisms) while maintaining interpretability and computational tractability.
Analyses linking representation quality to geometric or margin-based criteria, and quantifying the relationship between separability and information content (Choi et al., 2023).
Transferring concepts to non-linguistic structured domains (e.g., molecules, images, time series) and security applications (e.g., optimal masking schemes in cryptographic hardware).

Information-theoretic probing thus unifies interpretability, robustness, and practicality in the analysis of learned representations, providing quantifiable, comparable, and theoretically justified measures for properties previously approached heuristically.