Papers
Topics
Authors
Recent
Search
2000 character limit reached

Scalable Mutual Information Estimation

Updated 20 April 2026
  • Scalable mutual information estimation is a collection of methods that measure dependency between variables using neural, nonparametric, and projection-based techniques.
  • Key techniques include variational neural estimation (MINE), local nonuniformity correction in kNN-based estimators, and hashing or flow-based methods for linear-time performance.
  • These approaches enhance high-dimensional data analysis in applications like generative modeling, dependency testing, and scientific computing by balancing accuracy with computational efficiency.

Scalable mutual information estimation refers to the family of methodologies and algorithms for estimating mutual information (MI) between random variables or data streams in regimes where classical approaches are computationally or statistically inefficient—especially in high-dimensional, large-sample, or high-MI settings. This field sits at the intersection of information theory, statistics, machine learning, and scientific computing, with substantial impact on generative modeling, dependency analysis, neural representation learning, and large-scale data mining.

1. Variational and Neural Estimation Approaches

Mutual Information Neural Estimation (MINE) introduced variational bounds on mutual information based on deep neural networks, leveraging the Donsker–Varadhan dual representation of the KL divergence. Given joint and product-of-marginals distributions PXZP_{XZ} and PXPZP_X \otimes P_Z, the DV representation yields

I(X;Z)=DKL(PXZPXPZ)=supT{EPXZ[T]logEPXPZ[eT]}.I(X;Z) = D_{KL}(P_{XZ} \,\|\, P_X \otimes P_Z) = \sup_T \left\{ \mathbb{E}_{P_{XZ}}[T] - \log \mathbb{E}_{P_X \otimes P_Z}[e^T] \right\}.

Restricting TT to a neural network parameterization Tθ(x,z)T_\theta(x,z) defines the MINE objective, which is optimized using stochastic gradient ascent with samples from the joint and shuffled marginals. Critical to scalability, the computational cost is O(batch size×network size)O(\text{batch size} \times \text{network size}) per iteration, scaling linearly in data dimension and sample size. Strong consistency and sample-complexity bounds are established under boundedness and Lipschitz conditions on TθT_\theta (Belghazi et al., 2018).

In practice, MINE addresses bias in the denominator of the gradient via an exponential moving average, and limitations in tightness or stability are mitigated through careful critic parameterization or gradient clipping. MINE's framework underlies scalable applications such as robust generative adversarial networks, improved inference in bidirectional models (e.g., ALI/BiGAN), and direct optimization of deep information bottleneck objectives, all in high-dimensional settings.

2. High-Dimensional and Structured Nonparametric Estimation

Classical nonparametric MI estimators (e.g., kNN-based KSG) require sample size exponential in the true MI or dimension due to their reliance on local uniformity assumptions (Gao et al., 2014). This becomes prohibitive as I(X;Y)I(X;Y) or dd increases. Corrections such as the Local Nonuniformity Correction (LNC) relax the uniformity assumption, employing local PCA to adaptively estimate density volumes within kNN neighborhoods. The LNC-adjusted MI estimator exhibits superior scaling with respect to both MI and dimension, delivering accurate MI estimates for strongly dependent or near-deterministic variables with orders-of-magnitude fewer samples than standard kNN-type methods.

For better robustness across a range of data-generating processes, semiparametric models such as the nonparanormal (Gaussian copula) model enable closed-form MI estimation via the log-determinant of latent correlation matrices. Spearman- or Kendall-based estimators construct robust, consistent, and computationally tractable plug-in MI estimators with O(D2n+D3)O(D^2 n + D^3) complexity for PXPZP_X \otimes P_Z0 samples in PXPZP_X \otimes P_Z1 dimensions, empirically practical up to several thousand dimensions and immune to non-Gaussian marginal effects (Singh et al., 2017).

In the presence of low-dimensional manifold structure, geodesic kNN estimators replace ambient PXPZP_X \otimes P_Z2 balls by data-manifold geodesic neighborhoods. The G-KSG estimator inherits theoretical consistency (bias PXPZP_X \otimes P_Z3 if data lie on an PXPZP_X \otimes P_Z4-dimensional manifold) and remains accurate where ambient-dimension-based estimators fail (PXPZP_X \otimes P_Z5) (Marx et al., 2021).

3. Hashing, Graph, and Flow-Based Linear-Complexity Estimators

A major challenge in high-throughput or streaming environments is achieving both computational and statistical efficiency. The EDGE estimator (Ensemble Dependency Graph Estimator) combines locality-sensitive hashing, dependency graphs, and ensemble bias-correction to obtain PXPZP_X \otimes P_Z6-time nonparametric MI estimation (for PXPZP_X \otimes P_Z7 samples), with provable parametric PXPZP_X \otimes P_Z8 mean-squared error under differentiability of the density (Noshad et al., 2018). The ensemble method averages MI estimates over multiple random hashing resolutions, constructing bias-canceling weights that deliver optimal rates even in moderately high-dimensional scenarios.

Flow matching MI estimation (FMMI) reinterprets the discriminative paradigm as fitting a continuous-time normalizing flow that couples the product-of-marginals to the joint distribution. Because the expected divergence of the learned flow equals the KL-divergence, FMMI yields a scalable and precise neural MI estimator even when MI is large and dimension is high. Empirical evaluations demonstrate superior performance—robust to both high ambient dimension and high MI—compared to discriminative or classifier-based methods (Butakov et al., 11 Nov 2025).

4. Sliced and Max-Sliced Mutual Information for Statistical Scalability

Sliced Mutual Information (SMI) and its PXPZP_X \otimes P_Z9-dimensional extension (I(X;Z)=DKL(PXZPXPZ)=supT{EPXZ[T]logEPXPZ[eT]}.I(X;Z) = D_{KL}(P_{XZ} \,\|\, P_X \otimes P_Z) = \sup_T \left\{ \mathbb{E}_{P_{XZ}}[T] - \log \mathbb{E}_{P_X \otimes P_Z}[e^T] \right\}.0-SMI) address the curse of dimensionality by averaging the MI over random 1D or I(X;Z)=DKL(PXZPXPZ)=supT{EPXZ[T]logEPXPZ[eT]}.I(X;Z) = D_{KL}(P_{XZ} \,\|\, P_X \otimes P_Z) = \sup_T \left\{ \mathbb{E}_{P_{XZ}}[T] - \log \mathbb{E}_{P_X \otimes P_Z}[e^T] \right\}.1-dimensional linear projections of the variables. For I(X;Z)=DKL(PXZPXPZ)=supT{EPXZ[T]logEPXPZ[eT]}.I(X;Z) = D_{KL}(P_{XZ} \,\|\, P_X \otimes P_Z) = \sup_T \left\{ \mathbb{E}_{P_{XZ}}[T] - \log \mathbb{E}_{P_X \otimes P_Z}[e^T] \right\}.2,

I(X;Z)=DKL(PXZPXPZ)=supT{EPXZ[T]logEPXPZ[eT]}.I(X;Z) = D_{KL}(P_{XZ} \,\|\, P_X \otimes P_Z) = \sup_T \left\{ \mathbb{E}_{P_{XZ}}[T] - \log \mathbb{E}_{P_X \otimes P_Z}[e^T] \right\}.3

For I(X;Z)=DKL(PXZPXPZ)=supT{EPXZ[T]logEPXPZ[eT]}.I(X;Z) = D_{KL}(P_{XZ} \,\|\, P_X \otimes P_Z) = \sup_T \left\{ \mathbb{E}_{P_{XZ}}[T] - \log \mathbb{E}_{P_X \otimes P_Z}[e^T] \right\}.4, estimation reduces to a sequence of low-dimensional MI computations followed by Monte Carlo averaging. Theoretical error bounds reveal that the variance decreases as I(X;Z)=DKL(PXZPXPZ)=supT{EPXZ[T]logEPXPZ[eT]}.I(X;Z) = D_{KL}(P_{XZ} \,\|\, P_X \otimes P_Z) = \sup_T \left\{ \mathbb{E}_{P_{XZ}}[T] - \log \mathbb{E}_{P_X \otimes P_Z}[e^T] \right\}.5 (number of random projections), and—surprisingly—the estimation error can decrease as the ambient dimension grows when I(X;Z)=DKL(PXZPXPZ)=supT{EPXZ[T]logEPXPZ[eT]}.I(X;Z) = D_{KL}(P_{XZ} \,\|\, P_X \otimes P_Z) = \sup_T \left\{ \mathbb{E}_{P_{XZ}}[T] - \log \mathbb{E}_{P_X \otimes P_Z}[e^T] \right\}.6 is fixed, giving a "blessing of dimensionality" effect (Goldfeld et al., 2022, Goldfeld et al., 2021). These estimators are particularly effective for independence testing, feature screening, and as scalable surrogates in deep generative modeling (e.g., InfoGAN variants).

Max-sliced mutual information (mSMI) sharpens the approach by maximizing the MI over all I(X;Z)=DKL(PXZPXPZ)=supT{EPXZ[T]logEPXPZ[eT]}.I(X;Z) = D_{KL}(P_{XZ} \,\|\, P_X \otimes P_Z) = \sup_T \left\{ \mathbb{E}_{P_{XZ}}[T] - \log \mathbb{E}_{P_X \otimes P_Z}[e^T] \right\}.7-dimensional subspace projections (recovering CCA in the Gaussian case). The resultant neural estimators jointly optimize projection directions and a variational critic, achieving parametric I(X;Z)=DKL(PXZPXPZ)=supT{EPXZ[T]logEPXPZ[eT]}.I(X;Z) = D_{KL}(P_{XZ} \,\|\, P_X \otimes P_Z) = \sup_T \left\{ \mathbb{E}_{P_{XZ}}[T] - \log \mathbb{E}_{P_X \otimes P_Z}[e^T] \right\}.8 rates without the need for multiple critics over slices. mSMI methods outperform both CCA and average-slice approaches in independence testing, multi-view learning, and algorithmic fairness while maintaining computational scalability (Tsur et al., 2023).

5. Neural Network and Diffusion-Based Estimators for Unified and Real-Time Scalability

Fully supervised neural estimators such as MIST (Mutual Information via Supervised Training) take an empirical approach—meta-learning I(X;Z)=DKL(PXZPXPZ)=supT{EPXZ[T]logEPXPZ[eT]}.I(X;Z) = D_{KL}(P_{XZ} \,\|\, P_X \otimes P_Z) = \sup_T \left\{ \mathbb{E}_{P_{XZ}}[T] - \log \mathbb{E}_{P_X \otimes P_Z}[e^T] \right\}.9 from large synthetic datasets labeled with ground-truth MI. MIST uses a two-stage permutation-invariant attention transformer, handling variable sample sizes and dimensions, and is trained to predict either point or quantile (uncertainty-aware) MI estimates. Inference is achieved in milliseconds per dataset, scaling robustly to TT0 and TT1, with out-of-distribution generalization to novel dependency structures (Gritsai et al., 24 Nov 2025).

InfoNet directly maps batches of (preprocessed) sample pairs to an estimated MI through a feedforward attention mechanism. By learning to output a discretized optimal DV critic across synthetic distribution families, InfoNet achieves real-time MI estimation with accuracy on par or superior to classical methods, eliminating the need for test-time neural optimization (Hu et al., 2024).

Diffusion-based estimators, including MMG, exploit the relationship between mutual information and the integral of the gap between the unconditional and conditional MMSE under a Gaussian diffusion process:

TT2

By learning denoisers for both curves over a range of SNRs and using adaptive importance sampling, MMG achieves robust, bias-variance-controlled MI estimation in both low- and high-MI regimes, passing stringent self-consistency tests and outperforming diffusion-score based competitors (Yu et al., 24 Sep 2025).

6. Advances in Stability, Structure, and Domain-Specific Scalability

For kNN-type MI estimators, high dimensionality leads to floating-point overflow in calculations involving exponentiation of distances (e.g., for normalized MI). Logarithmic refactorization of the normalization step allows for stable and precise MI estimation up to dimensions TT3 without loss of fidelity or variance inflation, extending the reach of kNN-based methods to genomics, neuroimaging, and other "wide" data contexts (Tuononen et al., 2024).

Domain-specific scalability is further illustrated in quantum many-body systems, where scalable Monte Carlo methods for Rényi mutual information leverage SSE simulations with replica-stitched boundary conditions, enabling finite-size scaling analyses near criticality and across various lattice geometries (Melko et al., 2010). In large binary datasets, vectorized matrix-based methods compute all-pairs MI up to 50,000TT4 faster than naïve pairwise approaches by leveraging optimized linear algebra operations (Falcao, 2024).

7. Summary Table of Principal Scalable Estimation Methods

Method/Class Core Principle Scalability Driver
MINE DV variational neural estimator Linear in dim/sample; backprop
kNN/LNC/Geodesic Local density/geometry, PCA/manifold Adaptive, robust to concentration
Nonparanormal/Rank-based Copula, rank correlations Immunity to marginal mis-shaping
EDGE LSH hashing, ensemble graphs Linear-time bias-corrected
FMMI Flow-matching normalizing flows Reduces complexity in high MI
Sliced/TT5-SMI, mSMI Projections/slices, maximization Reduces estimation to low-dim
MIST, InfoNet Meta-learned deep network mappings O(1) feedforward per dataset
MMG (Diffusion) Integral of MMSE gap Parallel/importance sampling
Binary matrix method Vectorized pairwise MI (discrete) BLAS/tensor engine acceleration

Scalable mutual information estimation is a rapidly evolving field, driven by advances in neural estimation, nonparametric correction, dimension reduction, and efficient scientific computation. Modern approaches carefully calibrate bias, variance, and computational complexity, enabling accurate MI inference in domains previously inaccessible due to dimensionality, sample size, or dependency strength constraints (Belghazi et al., 2018, Butakov et al., 11 Nov 2025, Gritsai et al., 24 Nov 2025, Tuononen et al., 2024, Singh et al., 2017, Goldfeld et al., 2022, Goldfeld et al., 2021, Falcao, 2024, Yu et al., 24 Sep 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Scalable Mutual Information Estimation.