Mutual Information-Based Criterion

Updated 23 November 2025

Mutual Information-Based Criterion is a quantitative measure that assesses dependency between random variables, guiding feature selection and dimensionality reduction.
It leverages invariance and self-consistency properties to enhance robust estimation and algorithmic performance in statistical and signal processing applications.
The criterion underpins practical applications such as stopping rules in decoding, neural representation learning, and secure coding via explicit MI constraints.

A mutual information-based criterion refers to any principle, objective function, or selection rule that leverages mutual information (MI)—a fundamental scalar statistic measuring the dependence between random variables—as a core quantitative assessment. MI-based criteria support both classical and contemporary statistical, signal processing, learning, and information-theoretic systems by providing an operational, tractable formulation for quantifying association, relevance, extractability, or uncertainty reduction in diverse structured environments.

1. Mathematical Foundations of Mutual Information Criteria

Mutual information between random variables $X$ and $Y$ is defined as

$I(X;Y) = \int p(x, y) \log \frac{p(x, y)}{p(x)p(y)} dx\,dy$

and quantifies the reduction in uncertainty of one variable given knowledge of the other. It is symmetric ( $I(X;Y) = I(Y;X)$ ), non-negative, and equals zero if and only if $X$ and $Y$ are independent.

Formal MI criteria are used to:

Maximize relevance between representations and targets, e.g., $I(GX;Y)$ in dimension reduction (Razeghi et al., 2019).
Minimize redundancy or enforce conditional independence through chain-rule or higher-order MI (Vergara et al., 2015, Venkateswara et al., 2017).
Serve as stopping rules or thresholds: stopping iterative decoding when estimated MI exceeds a threshold linked to BER (Wu et al., 2013), or terminating feature addition when residual CMI is small (Yu et al., 2018).
Define security in distributed cryptosystems via leakage constraints $I(C;X) \leq \delta$ (Oohama et al., 17 Jul 2025).

These criteria map directly to operational or information-theoretic guarantees: minimal Bayes error (via Fano's inequality), security leakage, statistical significance, or task-aligned representational fidelity.

2. Equitability, Invariance, and Self-Consistency

A central property of MI-based criteria is self-equitability, tightly linked to the data processing inequality (DPI): for any deterministic function $f$ , if $(X \to f(X) \to Y)$ is a Markov chain,

$I(X;Y) = I(f(X);Y)$

meaning that MI is invariant under all invertible transformations and reduction to sufficient statistics (Kinney et al., 2013).

Unlike alternative measures (e.g., the maximal information coefficient, MIC), MI satisfies:

Full invariance under arbitrary invertible transforms,
The DPI (monotonicity under post-processing of variables),
Criterion independence from the particular parametric form of association, guaranteeing conceptual naturality and generality (Kinney et al., 2013).

The MI criterion thus quantifies all statistical dependence and equates relationships of equal noisiness, regardless of functional form, in contrast with grid-based or monotonic-invariant methods.

3. Algorithmic Realizations across Research Domains

Mutual information-based criteria are implemented in numerous algorithmic frameworks across disciplines:

A. Feature Selection

Maximal Relevance: Selecting features with maximal $I(X_j; Y)$ (Schnapp et al., 2020, Vergara et al., 2015).
mRMR (min-redundancy-maximal-relevance): Maximizing $I(X_j;Y) - \lambda \sum_{i \in S} I(X_j; X_i)$ (Liu et al., 2022, Vergara et al., 2015).
Unique Relevance (BUR): Augmenting relevance with the unique MI $I(X_j; Y | S \setminus \{X_j\})$ for redundancy control (Liu et al., 2022).
Global subset selection (BQP): Expressing $I(X_S;Y)$ as a quadratic form under conditional independence and solving with approximations like TPower and LowRank (Venkateswara et al., 2017).

B. Dimensionality Reduction

Greedy subspace selection: Rank directions by $I(g^\top X; Y)$ , maximize $I(GX;Y)$ , and construct projections for discriminativity under MI (Razeghi et al., 2019, Ozdenizci et al., 2021).
Stochastic MI-gradient neural dimensionality reduction (MMINet): Learn nonlinear mappings maximizing $I(\phi(X);Y)$ end-to-end without distributional assumptions (Ozdenizci et al., 2021).

C. Encoding, Decoding, and Representation Learning

Iterative decoder stopping: Monitor MI between bits and LLR output, set direct $I$ thresholds for decoding termination (Wu et al., 2013).
Self-supervised learning (SSL): Maximize $I(\text{representation}_1; \text{representation}_2)$ , reducing loss to log-determinant forms under distributional homeomorphism for efficient SSL objectives (Chang et al., 2024).
Communication and security: Define reliability or secrecy as explicit MI constraints, e.g., $I(\text{ciphertext}; \text{plaintext}) \le \delta$ in distributed encryption (Oohama et al., 17 Jul 2025).
Neural decoders via discriminative MI objectives: Train discriminators to realize $\arg\max_x p_{X|Y}(x|y)$ , maximizing $I(X;Y)$ for robust MAP decoding (Tonello et al., 2022).

D. Clustering, Hashing, and Compression

Cluster evaluation by average normalized MI (ANMI) with attribute-based references [0511013].
Online hashing: Drive updates and function learning by MI between Hamming distances and neighborhood indicators (Cakir et al., 2017).
Layerwise neural network pruning: Compute conditional geometric MI between filters for dependency-aware compression (Ganesh et al., 2020).

4. Estimation Methodologies and Practical Challenges

Estimation of MI in high-dimensional, continuous, or complex discrete domains remains a central theme:

Histogram-based estimators: Tractable but limited by the curse of dimensionality and sensitive to binning (Papana et al., 2009).
k-nearest neighbor (KNN, Kraskov) estimators: Consistency and stability for low to moderate dimensions; $k$ parameter controls smoothness/bias (Papana et al., 2009, Kinney et al., 2013).
Kernel and graph-based estimators: Kernel density for continuous variables; geometric estimators for structural dependencies (Papana et al., 2009, Ganesh et al., 2020).
Matrix-based Renyi entropy estimators: Direct RKHS functionals for joint and conditional MI without PDF estimation, scalable for stopping criteria (Yu et al., 2018).
Neural/variational estimators: Learnable discriminators for MI or density ratio approximations, especially under unknown or implicit distributions (Tonello et al., 2022, Chang et al., 2024).

Bias-variance tradeoffs, computational costs (bins, neighborhoods, spectral decompositions), convergence under finite samples, and robustness to distributional shifts are recurring concerns (Papana et al., 2009, Yu et al., 2018).

5. Extensions and Alternative Dependence Criteria

While canonical MI is defined via Kullback–Leibler divergence, several generalizations accommodate continuous, heavy-tailed, or privacy-sensitive regimes:

Criterion	Divergence/Metric	Key Properties
$I_{KL}(X;Y)$ (classical)	Kullback–Leibler	Unbounded/infinite; sensitive to support mismatches (Kuskonmaz et al., 2022)
$I_{JS}(X;Y)$ (Jensen–Shannon)	Jensen–Shannon	Symmetric, bounded [0, log 2]; metric under $\sqrt{\cdot}$ (Kuskonmaz et al., 2022)
$I_{TV}(X;Y)$ (Total Variation)	TV Distance	True metric; coarse for small differences (Kuskonmaz et al., 2022)
$I_W(X;Y)$ (Wasserstein–MI)	Wasserstein	Geometric, robust; higher computational cost (Kuskonmaz et al., 2022)

These alternatives are used when classical MI is ill-posed or numerically unstable, and admit plug-in, kNN, kernel, or Sinkhorn estimators depending on sample size and dimension (Kuskonmaz et al., 2022).

6. Theoretical Impact and Open Issues

MI-based criteria have foundational significance:

They are uniquely self-equitable among plausible dependence measures (Kinney et al., 2013).
They yield nontrivial Bayes-error bounds (Fano, Pinsker) (Razeghi et al., 2019, Vergara et al., 2015).
In feature selection, sufficiency and necessity of MI criteria relate to Markov blanket and unique relevance properties, but higher-order synergy, computational tractability for subset selection, and estimator efficiency remain open challenges (Liu et al., 2022, Vergara et al., 2015, Venkateswara et al., 2017).
Extensions to causal discovery and multi-view learning exploit MI's invariance and DPI-based self-consistency.

7. Empirical Performance and Synthesis

Across domains, MI-based criteria achieve:

Improved accuracy and dimensional efficiency in feature selection, when augmented with unique relevance terms and robust estimators (Liu et al., 2022, Razeghi et al., 2019).
State-of-the-art results in self-supervised learning and online hashing by directly optimizing MI-based objectives (Chang et al., 2024, Cakir et al., 2017).
Robust and computationally efficient stopping rules for iterative systems (Wu et al., 2013, Yu et al., 2018).
Principled design of secure coding and beamforming systems via explicit MI constraints (Oohama et al., 17 Jul 2025, Li et al., 2022).

Their operational success is grounded in rigorous invariance, self-consistency, and direct empirical links to task objectives.

References:

Equitability, invariance, and failure modes of MIC: (Kinney et al., 2013)
Stopping rules for iterative decoding: (Wu et al., 2013), feature selection: (Yu et al., 2018)
Explicit MI maximization in SSL: (Chang et al., 2024)
Compression via geometric conditional MI: (Ganesh et al., 2020)
MI-based discriminative subspaces: (Razeghi et al., 2019)
Global feature selection as BQP: (Venkateswara et al., 2017)
Feature selection frameworks and taxonomy: (Vergara et al., 2015)
Practical estimation, alternative metrics: (Kuskonmaz et al., 2022, Papana et al., 2009)
Maximal MI in concept discovery: (Zhou et al., 2024)
Unique relevance augmentation: (Liu et al., 2022)
MI-based clustering: [0511013]
MI security in source encryption: (Oohama et al., 17 Jul 2025)
MI-based neural decoding: (Tonello et al., 2022)
MIHash: (Cakir et al., 2017)
Bandit-based active MI selection: (Schnapp et al., 2020)
MI-gradient deep DR: (Ozdenizci et al., 2021)