Mutual Information Gap in Representation Learning

Updated 18 December 2025

MIG is a metric that quantifies the gap between the highest and second-highest mutual information values, assessing latent representation disentanglement.
In video prediction and language modeling, MIG guides training by reducing information spillover and boosting factor-specific performance.
Extensions like DMIG adjust for correlated attributes, improving measurement accuracy in settings where generative factors are interdependent.

The Mutual Information Gap (MIG) is a quantitative metric for assessing the disjointness of information captured by latent representations in machine learning models, particularly in the evaluation of disentanglement and in optimizing training targets for LLMs. MIG measures, in various domains, the extent to which a model’s subcomponents or latent codes specialize in capturing specific generative factors or target tokens, relative to possible information spillover into non-matching components. MIG variants are applied across supervised disentanglement, unsupervised video prediction, and LLM training, serving to both diagnose representational quality and inform information-optimal training strategies (Yang et al., 31 Oct 2025, Sreekar et al., 2020, Watcharasupat et al., 2021).

1. Formal Definitions and Mathematical Formulation

MIG quantifies the difference (the "gap") between mutual information (MI) associated with a latent variable or target block’s correct assignment versus its second-most informative assignment, normalized to facilitate cross-factor comparability. In canonical disentanglement literature, the MIG associated with attribute $a_i$ and latent code block $z_j$ is

$\mathrm{MIG}(a_i) = \frac{\mathcal{I}(a_i;z_{j^*}) - \mathcal{I}(a_i;z_{j^{(2)}})}{\mathcal{H}(a_i)}$

where $\mathcal{I}$ is MI, $\mathcal{H}$ is entropy, $j^*$ is the index of the most informative latent, and $j^{(2)}$ is the second-most informative (Watcharasupat et al., 2021).

In video disentanglement with content and pose blocks, the metric is adapted as

$\mathrm{MIG} = \frac{0.5}{H(f_c)}[I(f_c, z_c) - I(f_c, z_p)] + \frac{0.5}{H(f_p)}[I(f_p, z_p) - I(f_p, z_c)]$

where $f_c, f_p$ are discrete content and pose factors, and $z_c, z_p$ are learned content and pose vectors (Sreekar et al., 2020).

For LLM target token selection, the mutual information gap between two target tokens is

$\mathrm{MIG}(t_i,t_j) = \mathrm{MI}(S; t_i) - \mathrm{MI}(S; t_j),$

and the maximal gap within a set is

$\mathrm{MIG}_{\max} = \max_i \mathrm{MI}(S; t_i) - \min_j \mathrm{MI}(S; t_j)$

where $S$ is the source or current context (Yang et al., 31 Oct 2025).

2. MIG in Disentangled Representation Learning

In disentanglement, the goal is for each latent code to exclusively represent a distinct generative attribute. MIG operationalizes this by quantifying the extent to which information about each factor is captured uniquely by one latent, as opposed to being distributed across several. The per-factor MIG evaluates the informativeness gap between the best-matching latent dimension and runners-up, with normalization ensuring comparability across factors with different entropies (Watcharasupat et al., 2021). A high average MIG value is a direct indicator that the model achieves successful, axis-aligned disentanglement.

The video disentanglement literature further refines MIG for block-structured latent spaces. The metric is adapted to two blocks (content and pose) and incorporates both intra-block information and cross-block leakage (Sreekar et al., 2020). Empirically, higher MIG scores are strongly correlated with sharper reconstructions and improved latent factor swaps.

3. MIG in Training LLMs

In the context of LLMs, MIG acquires a different operational meaning. Rather than diagnosing latent factorization, MIG serves as an operational measure guiding the selection and ordering of target tokens during training. Here, MIG is defined by the spread in MI between source context and candidate target tokens. By preferentially selecting tokens with the maximal MI, one reduces the information difference—the "gap”—between the most and least informative tokens early in the output sequence, accelerating convergence and improving final accuracy (Yang et al., 31 Oct 2025).

In arithmetic, multi-label classification, and open-vocabulary text generation, reordering target tokens according to descending MI leads to substantial empirical gains: up to ∼29 percentage points in accuracy for arithmetic, higher ROUGE for text, and lower perplexity for text generation tasks. These effects are strongest when strong conditional dependencies or under-represented languages are present.

4. Estimation Procedures and Practical Implementation

Disentanglement and Video Models

MIG estimation requires evaluation of MI between discrete generative factors and the continuous or discrete latent representations. In practice, MI for discrete–continuous pairs is estimated using the Ross (2014) k-NN estimator, and discrete entropies via the Grassberger estimator (Sreekar et al., 2020). The estimation proceeds by encoding samples, gathering appropriate factor–latent tuples, computing MI for each pairing, and assembling these into the MIG formula.

Language Modeling

For LLMs, MI between current source $S$ and candidate token $t$ is estimated:

For finite vocabularies, empirically via co-occurrence frequencies in the training set.
For open vocabulary, a Markov (bigram) model over lemmatized tokens is estimated by logistic regression, computing joint and marginal probabilities, and assignment to "MI-score" (Yang et al., 31 Oct 2025).

Training pipelines reorder the target token sequence at each step to maximize the per-token MI, reducing the MIG as training proceeds. Cross-entropy loss is maintained, but applied to the MI-permuted sequence.

5. Limitations and Corrections of Standard MIG

A fundamental limitation of standard MIG arises when ground-truth factors or semantic attributes are not statistically independent. MIG was derived under independence, so when attributes are correlated ( $\mathcal{I}(a_i;a_j) > 0$ ), the normalization by $\mathcal{H}(a_i)$ is too loose, and MIG systematically underestimates disentanglement. Even in ideal settings where each latent perfectly encodes its intended attribute, nonzero cross-mutual-information persists due to inter-attribute dependencies.

To address this, Watcharasupat & Lerch (Watcharasupat et al., 2021) propose the dependency-aware mutual information gap (DMIG), where the denominator is replaced with the appropriate conditional entropy: $\mathrm{DMIG}(a_i) = \frac{\mathcal{I}(a_i;z_{j^*}) - \mathcal{I}(a_i;z_{j^{(2)}})}{\mathcal{H}(a_i|a_j)}$ if $a_j$ is the second-most informative attribute for $z_{j^{(2)}}$ . This directly accounts for dependencies and recovers consistent measurement even in correlated settings.

6. Empirical Results and Comparative Analysis

Empirical results demonstrate the diagnostic and optimization advantages of MIG and its variants:

Video Disentanglement: On DSprites and MPI3D-Real datasets, the MIPAE model achieves higher MIG values (0.8975 vs. 0.8574 on DSprites) relative to DRNET, corresponding to more effective disentanglement and superior qualitative performance (Sreekar et al., 2020).
Language Modeling: MIG-based target selection yields large gains across arithmetic, multi-label classification, and text generation. For example, for GPT-based models on arithmetic, "Plain" (sequential) order accuracy is 74.8%, "Reverse" order is 86.3%, but "Max(MI)" rises to 94.96% (Yang et al., 31 Oct 2025).
Dependency Correction: DMIG corrects the underestimation defect of standard MIG, as shown in music generation tasks where semantic attribute correlation is strong. MIG remains near zero despite perfect latent–factor correlation, while DMIG tracks the true level of disentanglement faithfully (Watcharasupat et al., 2021).

Task/Domain	Standard MIG	DMIG (if applicable)	Notable Empirical Effect
Video Disentanglement	Up to 0.8975	(not used here)	Sharper reconstructions
Language Modeling	↑ Accuracy, ↓ Perplexity	–	Faster, better LLM convergence
Correlated Attributes	Underestimates	Accurate	DMIG correlates with SCC

7. Applications, Impact, and Guidelines

MIG and its corrections are now a standard component of disentanglement benchmarks and are being repurposed for information-optimal training in autoregressive models. MIG provides a quantitative, comparably normalized, and factor-wise evaluation of whether and where information is concentrated in latent or output blocks.

Guidelines:

Use standard MIG for axis-aligned, independent factors when cross-attribute MI is negligible.
For settings with correlated factors, employ DMIG for faithful measurement (Watcharasupat et al., 2021).
In generative decoding or training (e.g., LLMs), use MI to order targets, systematically reducing the MIG to accelerate uncertainty reduction and convergence (Yang et al., 31 Oct 2025).
In block-structured latent models, adapt MIG to report cross-block leakage/capture (Sreekar et al., 2020).

MIG estimation requires careful empirical MI and entropy estimation, attention to regularization in continuous variables, and, where needed, correction for attribute dependencies.

MIG continues to inform both the rigorous evaluation of representation learning and principled modifications of training algorithms for large-scale models. Further research targets computational efficiency in MI estimation and hybrid approaches that combine MIG-aware ordering with dynamic uncertainty-aware sampling (Yang et al., 31 Oct 2025).