Papers
Topics
Authors
Recent
Search
2000 character limit reached

Information Bottleneck Lagrangian

Updated 31 March 2026
  • Information Bottleneck (IB) Lagrangian is a framework that formalizes the trade-off between compressing an input X and preserving information about a target Y using a Lagrange multiplier β.
  • It exhibits a phase transition at a critical β threshold where nontrivial representations emerge, marking the onset of IB-learnability and guiding hyperparameter selection.
  • Modern optimization techniques, including Blahut–Arimoto iterations, ADMM solvers, and operator splitting, are employed to navigate its non-convex landscape effectively.

The Information Bottleneck (IB) Lagrangian formalizes the trade-off between compressing an input random variable XX into a representation ZZ and preserving information about a relevant target variable YY. It plays a central role in the theory and practice of representation learning, rate-distortion theory, and statistical learning frameworks to extract minimal-sufficient representations (Wu et al., 2019, Kolchinsky et al., 2017, Pan et al., 2020).

1. Formal Definition and Principle

Given a joint distribution P(X,Y)P(X,Y), the IB Lagrangian is defined via an encoder P(ZX)P(Z|X) forming a Markov chain XZYX \to Z \leftarrow Y. The functional is

LIB[P(ZX)]=I(X;Z)βI(Y;Z),\mathcal{L}_{\mathrm{IB}}[P(Z|X)] = I(X; Z) - \beta\, I(Y; Z),

where:

  • I(X;Z)=x,zP(x)P(zx)logP(zx)P(z)I(X; Z) = \sum_{x,z} P(x) P(z|x) \log \frac{P(z|x)}{P(z)} measures the complexity (rate) or the retained information about XX.
  • I(Y;Z)=y,zP(y)P(zy)logP(zy)P(z)I(Y; Z) = \sum_{y,z} P(y) P(z|y) \log \frac{P(z|y)}{P(z)} quantifies the relevance, i.e., the information ZZ has about YY after compression.
  • β0\beta \geq 0 is a Lagrange multiplier tuning the trade-off between compression (lower I(X;Z)I(X;Z)) and prediction (higher I(Y;Z)I(Y;Z)) (Wu et al., 2019, Kolchinsky et al., 2017).

In the classical constrained problem,

minP(ZX)I(X;Z)s.t.I(Y;Z)R,\min_{P(Z|X)}\, I(X; Z) \quad\text{s.t.}\quad I(Y; Z) \geq R,

the Lagrangian relaxation leads directly to LIB\mathcal{L}_{\mathrm{IB}} (Kamatsuka et al., 20 Feb 2026, Rodríguez-Gálvez et al., 2019). The minimizer traces out the so-called "IB curve"—the Pareto frontier in the I(X;Z)I(X;Z)I(Y;Z)I(Y;Z) plane.

2. Phase Transition and IB-Learnability

The IB Lagrangian exhibits a critical phase transition: for small β\beta, the trivial encoder P(ZX)=P(Z)P(Z|X) = P(Z) (i.e., ZZ is independent of XX) is globally optimal, yielding I(X;Z)=0I(X;Z)=0, I(Y;Z)=0I(Y;Z)=0, and LIB=0\mathcal{L}_{\mathrm{IB}}=0. This trivial solution dominates for all β1\beta \leq 1, due to the data-processing inequality I(Y;Z)I(X;Z)I(Y;Z) \leq I(X;Z). Non-trivial representations with LIB<0\mathcal{L}_{\mathrm{IB}} < 0 and nonzero I(Y;Z)I(Y;Z) become possible only for β>1\beta > 1 (Wu et al., 2019).

  • IB-Learnability: The dataset (X,Y)(X,Y) is IBβ_\beta-learnable if there exists P(ZX)P(Z|X) with LIB[P(ZX)]<0\mathcal{L}_{\mathrm{IB}}[P(Z|X)] < 0. The critical threshold

β0=inf{β:(X,Y) is IBβ-learnable}\beta_0 = \inf\{\beta : (X, Y)\ \text{is IB}_\beta\text{-learnable}\}

marks a sharp phase transition, below which only the trivial solution exists.

  • Characterization: The onset satisfies 1/β0=maxZXYI(Y;Z)/I(X;Z)1/\beta_0 = \max_{Z-X-Y} I(Y;Z)/I(X;Z). This is given by the hypercontractivity coefficient or contraction coefficient, and quantifies the data's maximal transmission efficiency through the bottleneck (Wu et al., 2019).

3. Sufficient Conditions and Practical Estimation

Several sufficient conditions for IB-learnability have been established:

  • Second-order variation: The trivial solution ceases to be a (local) minimum of the Lagrangian when there exists a perturbation h(zx)h(z|x) such that the second variation δ2LIB<0\delta^2 \mathcal{L}_{\mathrm{IB}} < 0 (Wu et al., 2019).
  • Functional bound: For all score functions h()h(\cdot), β>infhβ0[h]\beta > \inf_h \beta_0[h], with

β0[h]=VarX(h(X))VarY(EXY[h(X)])\beta_0[h] = \frac{\mathrm{Var}_X(h(X))}{\mathrm{Var}_{Y}( \mathbb{E}_{X|Y}[h(X)] ) }

  • Conspicuous subset: Taking hh as the indicator of a subset ΩX\Omega \subset X, one obtains a practical bound involving the size, imbalance, and conditional confidence of Ω\Omega.

A practical estimation procedure consists of training a classifier P^(yx)\hat{P}(y|x), sorting examples by class likelihood, and, for each class, minimizing a formula over the top-kk samples to estimate β0\beta_0, then choosing β>β~0\beta > \tilde{\beta}_0 during IB optimization (Wu et al., 2019).

4. Connections to Phase Structure, Capacity, and Noise

  • Model Capacity: Insufficient model capacity leads to higher empirical β0\beta_0 since learned P(YX)P(Y|X) is noisier, mimicking the effects of label noise or class overlap.
  • Phase Structure: At critical values of β\beta, the cardinality/structure of the optimal encoding changes—phase transitions manifest as discontinuities on the information plane (I(X;Z),I(Y;Z))(I(X; Z), I(Y; Z)) (Wu et al., 2019, Huang et al., 2021).
  • Noise: For deterministic tasks (YY is a function of XX), β0=1\beta_0 = 1, i.e., any β>1\beta > 1 yields nontrivial solutions, but in the presence of class-conditional noise, β0\beta_0 increases with label corruption.

5. Optimization Algorithms and Computational Aspects

Optimization of the IB Lagrangian is non-convex and sensitive to initialization, particularly near phase transitions. Various algorithms have been proposed:

  • Blahut–Arimoto-type Fixed-Point Iteration: Alternates between updates for P(ZX)P(Z|X), P(Z)P(Z), and P(YZ)P(Y|Z), reliable in the strictly concave regime, but slow and sometimes unstable when the IB curve is piecewise linear.
  • Variants: Recent work includes entropy-regularized optimal transport (Sinkhorn-type) methods (Chen et al., 2023); provably convergent ADMM solvers (Huang et al., 2021); semi-relaxed closed-form alternating minimization (Chen et al., 2024); and linearly convergent operator-splitting methods (Huang et al., 2022).

Empirically, convergence near critical β\beta is challenging and local minima may abound. Theoretical convergence guarantees for modern splitting and ADMM variants address this limitation (Huang et al., 2021, Chen et al., 2024, Huang et al., 2022).

6. Extensions Beyond the Classical Lagrangian

Generalized IB objectives have been formulated by replacing I(Y;Z)I(Y;Z) with decision-theoretically motivated H\mathcal{H}-mutual information, for entropy-like functionals H\mathcal{H} satisfying concavity and averaging properties; alternating optimization algorithms allow recovery of the classical and generalized settings (Kamatsuka et al., 20 Feb 2026). For deterministic Y=f(X)Y=f(X), the linear Lagrangian fails to sweep the IB curve; strictly convex penalties (e.g., squared-IB or general convex u(I(X;T))u(I(X;T)) functions) restore a one-to-one mapping between trade-off hyperparameter and achieved compression (Kolchinsky et al., 2018, Rodríguez-Gálvez et al., 2019). In Gaussian scenarios, the IB curve admits spectral (water-filling) characterization (Dikshtein et al., 2022).

7. Practical and Theoretical Significance

The IB Lagrangian provides a principled framework for achieving minimal sufficient representations, reconciling rate-distortion and predictive representation learning. The phase transition (learnability threshold) formalizes the onset of nontrivial feature learning and offers concrete guidance for hyperparameter selection and evaluation of model capacity (Wu et al., 2019). Modern implementation—spanning variational surrogates, mapping approaches, and neural estimators—validates and extends these insights to high-dimensional, structured, and deep-network regimes (Chen et al., 26 Jul 2025, Kolchinsky et al., 2017, Yang et al., 2024). The IB Lagrangian thus remains foundational for both theoretical analysis and the algorithmic development of minimal, robust, and task-relevant representations in information theory and machine learning.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Information Bottleneck (IB) Lagrangian.