Info-Theoretic Lagrangian Formulation

Updated 4 July 2026

Information-Theoretic Lagrangian Formulation is a variational framework that treats information measures such as mutual information and divergence as primary objectives and constraints.
It unifies diverse approaches including the Information Bottleneck, latent-variable generative models, Hamilton–Jacobi information geometry, and stochastic-flow uncertainty quantification under a common optimization paradigm.
By converting constrained problems into unconstrained Lagrangian duals, it offers practical insights for model calibration, parameter invariance, and effective divergence control.

An information-theoretic Lagrangian formulation is a variational construction in which the central objective, the constraints, or the generated two-point function are information-theoretic quantities such as mutual information, relative entropy, a $\varphi$ -divergence, or a canonical divergence on a statistical manifold. In the cited literature, this notion appears in several mathematically distinct but structurally related settings: the Information Bottleneck (IB), where $I(T;Y)$ is optimized under a compression constraint $I(X;T)\le r$ ; latent-variable generative modeling, where mutual information is optimized subject to consistency constraints between encoder and decoder factorizations; Hamilton–Jacobi constructions in information geometry, where a divergence function is realized as a Hamilton principal function; and uncertainty quantification for stochastic flows, where path-space $\varphi$ -divergences control Lagrangian prediction error (Rodríguez-Gálvez et al., 2019, Zhao et al., 2018, Ciaglia et al., 2017, Branicki et al., 2019). A related geometric line develops reparameterisation-invariant Lagrangian formalisms on Finsler and Kawaguchi manifolds, supplying a parameter-independent variational language that can be adapted to information-geometric interpretations (Tanaka, 2013).

1. Variational structure and scope

A recurring pattern is the replacement of an explicitly constrained information-theoretic problem by an unconstrained Lagrangian. In the IB setting, the primal problem is

$F_{\textnormal{IB,max}}(r) = \max_{T \in \Delta} \big\{ I(T;Y) \big\} \quad \text{s.t.}\quad I(X;T) \le r,$

with $\Delta$ the set of representations obeying the Markov chain $Y \leftrightarrow X \leftrightarrow T$ . In latent-variable generative modeling, the primal problem optimizes mutual information between latent and visible variables subject to divergences enforcing consistency between $p_\theta(x,z)=p(z)p_\theta(x\mid z)$ and $q_\theta(x,z)=q(x)q_\theta(z\mid x)$ . In information geometry, the variational object is not primarily an optimization criterion but a Lagrangian $\mathfrak{L}$ on $I(T;Y)$ 0 whose Hamilton principal function $I(T;Y)$ 1 reproduces a divergence $I(T;Y)$ 2 and the tensors $I(T;Y)$ 3 of a statistical manifold. In stochastic-flow uncertainty quantification, the relevant Lagrangian object is path-based: the discrepancy between Lagrangian observables is bounded by divergences between path measures induced by distinct Eulerian dynamics (Rodríguez-Gálvez et al., 2019, Zhao et al., 2018, Ciaglia et al., 2017, Branicki et al., 2019).

These formulations differ in what the multiplier enforces. In IB and in generative modeling, Lagrange multipliers penalize violations of compression or model-consistency constraints. In Hamilton–Jacobi information geometry, the Lagrangian encodes metric and skewness data so that the action itself becomes a divergence. In stochastic flows, the variational structure is supplied by the Legendre–Fenchel dual representation

$I(T;Y)$ 4

which converts divergence control into sharp bounds on observable discrepancies (Branicki et al., 2019).

A plausible implication is that “information-theoretic Lagrangian formulation” is best understood as a family of constructions rather than a single canonical formalism. What unifies them is that information quantities are treated as variational primitives rather than as secondary diagnostics.

2. Information Bottleneck and convex information penalties

The classical IB problem seeks compressed representations $I(T;Y)$ 5 of $I(T;Y)$ 6 that retain task-relevant information about $I(T;Y)$ 7. Its standard Lagrangian is

$I(T;Y)$ 8

with $I(T;Y)$ 9. Standard practice is to solve $I(X;T)\le r$ 0 for many values of $I(X;T)\le r$ 1, plot the resulting points $I(X;T)\le r$ 2 in the information plane, and select a representation near the desired compression level $I(X;T)\le r$ 3. Algorithms used include Blahut–Arimoto, deterministic annealing, agglomerative IB, and, for high-dimensional data, neural-network based IB such as variational IB and nonlinear IB (Rodríguez-Gálvez et al., 2019).

The difficulty is that there is no one-to-one mapping between $I(X;T)\le r$ 4 and the attained compression $I(X;T)\le r$ 5. The pathology is most explicit when $I(X;T)\le r$ 6 is deterministic. Kolchinsky et al. showed that the IB curve is piecewise linear:

$I(X;T)\le r$ 7

and

$I(X;T)\le r$ 8

In this case, the classical “scan over $I(X;T)\le r$ 9” fails to explore the trade-off: the entire increasing part is reached with the same $\varphi$ 0, all points in the flat part beyond $\varphi$ 1 share $\varphi$ 2, and the point $\varphi$ 3 is a maximizer for all $\varphi$ 4 (Rodríguez-Gálvez et al., 2019).

The convex IB formulation replaces the linear penalty by a strictly convex increasing function:

$\varphi$ 5

If $\varphi$ 6 is monotonically increasing and strictly convex, then the IB curve can be completely recovered by maximizers of $\varphi$ 7, and for each point on the IB curve with slope $\varphi$ 8 there is a unique $\varphi$ 9 reaching that point. The multiplier is strictly decreasing as a function of the compression $F_{\textnormal{IB,max}}(r) = \max_{T \in \Delta} \big\{ I(T;Y) \big\} \quad \text{s.t.}\quad I(X;T) \le r,$ 0. Standard IB is recovered by $F_{\textnormal{IB,max}}(r) = \max_{T \in \Delta} \big\{ I(T;Y) \big\} \quad \text{s.t.}\quad I(X;T) \le r,$ 1, but this is a degenerate boundary case because $F_{\textnormal{IB,max}}(r) = \max_{T \in \Delta} \big\{ I(T;Y) \big\} \quad \text{s.t.}\quad I(X;T) \le r,$ 2 is not strictly convex. The squared IB Lagrangian corresponds to $F_{\textnormal{IB,max}}(r) = \max_{T \in \Delta} \big\{ I(T;Y) \big\} \quad \text{s.t.}\quad I(X;T) \le r,$ 3, the power IB family to $F_{\textnormal{IB,max}}(r) = \max_{T \in \Delta} \big\{ I(T;Y) \big\} \quad \text{s.t.}\quad I(X;T) \le r,$ 4 with $F_{\textnormal{IB,max}}(r) = \max_{T \in \Delta} \big\{ I(T;Y) \big\} \quad \text{s.t.}\quad I(X;T) \le r,$ 5, and the exponential IB Lagrangian to $F_{\textnormal{IB,max}}(r) = \max_{T \in \Delta} \big\{ I(T;Y) \big\} \quad \text{s.t.}\quad I(X;T) \le r,$ 6 with $F_{\textnormal{IB,max}}(r) = \max_{T \in \Delta} \big\{ I(T;Y) \big\} \quad \text{s.t.}\quad I(X;T) \le r,$ 7 (Rodríguez-Gálvez et al., 2019).

When the IB curve is known as $F_{\textnormal{IB,max}}(r) = \max_{T \in \Delta} \big\{ I(T;Y) \big\} \quad \text{s.t.}\quad I(X;T) \le r,$ 8, the convex formulation yields an explicit bijection between compression and multiplier:

$F_{\textnormal{IB,max}}(r) = \max_{T \in \Delta} \big\{ I(T;Y) \big\} \quad \text{s.t.}\quad I(X;T) \le r,$ 9

This is the exact “missing piece” not provided in the earlier squared-IB argument. It implies that, for known IB curve shapes, one can select a desired compression $\Delta$ 0, compute the corresponding $\Delta$ 1, and solve a single unconstrained optimization rather than scanning many multipliers. When the IB curve is unknown, the shifted exponential IB Lagrangian

$\Delta$ 2

is proposed as a practical approximation. For large $\Delta$ 3, many values of $\Delta$ 4 produce optima with $\Delta$ 5, a phenomenon called value convergence (Rodríguez-Gálvez et al., 2019).

Conceptually, this is an information-theoretic Lagrangian formulation in a strict sense: the objective is $\Delta$ 6, the constraint is $\Delta$ 7, and the unconstrained dual objective is again written entirely in information-theoretic terms. The paper explicitly notes the parallel with rate–distortion and capacity–cost problems (Rodríguez-Gálvez et al., 2019).

3. Latent-variable generative models as Lagrangian duals

In latent-variable generative modeling, the information-theoretic Lagrangian formulation starts from a single constrained primal problem. The model has observed variables $\Delta$ 8 and latent variables $\Delta$ 9, with decoder $Y \leftrightarrow X \leftrightarrow T$ 0, encoder $Y \leftrightarrow X \leftrightarrow T$ 1, data distribution $Y \leftrightarrow X \leftrightarrow T$ 2, prior $Y \leftrightarrow X \leftrightarrow T$ 3, and two joint factorizations

$Y \leftrightarrow X \leftrightarrow T$ 4

Consistency is encoded by a vector of divergences $Y \leftrightarrow X \leftrightarrow T$ 5 such that $Y \leftrightarrow X \leftrightarrow T$ 6 if and only if $Y \leftrightarrow X \leftrightarrow T$ 7. A preference over consistent joints is expressed through

$Y \leftrightarrow X \leftrightarrow T$ 8

where the signs of $Y \leftrightarrow X \leftrightarrow T$ 9 determine whether mutual information is maximized or minimized. The corresponding Lagrangian is

$p_\theta(x,z)=p(z)p_\theta(x\mid z)$ 0

or, in the relaxed problem,

$p_\theta(x,z)=p(z)p_\theta(x\mid z)$ 1

This formulation is used to show that many objectives are Lagrangian dual functions of the same primal optimization problem (Zhao et al., 2018).

The paper recovers a large class of models by specific choices of $p_\theta(x,z)=p(z)p_\theta(x\mid z)$ 2, $p_\theta(x,z)=p(z)p_\theta(x\mid z)$ 3, and $p_\theta(x,z)=p(z)p_\theta(x\mid z)$ 4. VAE corresponds to $p_\theta(x,z)=p(z)p_\theta(x\mid z)$ 5 with a single KL joint-consistency term. $p_\theta(x,z)=p(z)p_\theta(x\mid z)$ 6-VAE yields

$p_\theta(x,z)=p(z)p_\theta(x\mid z)$ 7

so $p_\theta(x,z)=p(z)p_\theta(x\mid z)$ 8 explicitly minimizes mutual information. InfoGAN corresponds to MI maximization under marginal-matching and inference-consistency constraints. AAE, InfoVAE, ALI/BiGAN, ALICE, CycleGAN, DiscoGAN, and AS-VAE are likewise represented as specific Lagrangian instantiations (Zhao et al., 2018).

The framework also classifies objectives by computability. Likelihood-based terms $p_\theta(x,z)=p(z)p_\theta(x\mid z)$ 9 are expectations of log-likelihoods and are efficiently estimated with Monte Carlo and reparametrization. Unary likelihood-free terms $q_\theta(x,z)=q(x)q_\theta(z\mid x)$ 0 are divergences over a single marginal, such as $q_\theta(x,z)=q(x)q_\theta(z\mid x)$ 1 or $q_\theta(x,z)=q(x)q_\theta(z\mid x)$ 2, and require adversarial training or kernel methods. Binary likelihood-free terms $q_\theta(x,z)=q(x)q_\theta(z\mid x)$ 3 are divergences between joint distributions $q_\theta(x,z)=q(x)q_\theta(z\mid x)$ 4 and are empirically harder still. The closure theorem states that, for KL-based divergences and under the paper’s equivalence notion, any likelihood-based computable KL Lagrangian objective is equivalent to a linear combination of VMI and $q_\theta(x,z)=q(x)q_\theta(z\mid x)$ 5-VAE; any unary likelihood-free computable KL Lagrangian objective is equivalent to a linear combination of InfoVAE and InfoGAN; and any binary likelihood-free computable KL Lagrangian objective is equivalent to a linear combination of ALICE, InfoVAE, and InfoGAN (Zhao et al., 2018).

A further contribution is dual optimization over both model parameters and multipliers:

$q_\theta(x,z)=q(x)q_\theta(z\mid x)$ 6

The paper uses convex upper bounds when mutual information is minimized and concave lower bounds when it is maximized, proves strong duality under mild conditions, and argues that the saddle-point solution is Pareto optimal with respect to the mutual-information objective and the constraint values. This supplies an optimization-theoretic interpretation of fixed- $q_\theta(x,z)=q(x)q_\theta(z\mid x)$ 7 generative objectives as particular, not generally Pareto-optimal, points in a broader Lagrangian family (Zhao et al., 2018).

4. Hamilton–Jacobi theory and information geometry

A different information-theoretic Lagrangian formulation arises in information geometry. A statistical manifold is written as $q_\theta(x,z)=q(x)q_\theta(z\mid x)$ 8, where $q_\theta(x,z)=q(x)q_\theta(z\mid x)$ 9 is a Riemannian metric and $\mathfrak{L}$ 0 is a symmetric rank-3 tensor encoding skewness and dual connections. The central proposal is to define a Lagrangian on $\mathfrak{L}$ 1 such that its Hamilton principal function becomes a divergence, or more generally a potential function, on $\mathfrak{L}$ 2. The basic family is

$\mathfrak{L}$ 3

with associated action

$\mathfrak{L}$ 4

Its Hamilton principal function $\mathfrak{L}$ 5 satisfies derivative identities that recover the statistical tensors:

$\mathfrak{L}$ 6

and

$\mathfrak{L}$ 7

Thus $\mathfrak{L}$ 8 reproduces the metric and skewness data of the statistical manifold (Ciaglia et al., 2017).

The self-dual case $\mathfrak{L}$ 9 is especially transparent. Then the only connection is the Levi-Civita connection of $I(T;Y)$ 00, the canonical divergence is the square of the geodesic distance,

$I(T;Y)$ 01

and this divergence is exactly the Hamilton principal function of the metric Lagrangian

$I(T;Y)$ 02

This solves the inverse problem completely for self-dual statistical manifolds: given the canonical divergence, one recovers a Lagrangian whose principal function equals it (Ciaglia et al., 2017).

The framework also includes explicit non-self-dual examples. For the one-dimensional exponential family

$I(T;Y)$ 03

the Kullback–Leibler divergence

$I(T;Y)$ 04

is shown to be the Hamilton principal function of

$I(T;Y)$ 05

The authors also note that the dynamical system associated with $I(T;Y)$ 06 coincides with that of the metric Lagrangian for this model, so the two Lagrangians are alternative Lagrangians for the same dynamics (Ciaglia et al., 2017).

The same approach extends to quantum pure states. On $I(T;Y)$ 07, the manifold is self-dual with metric $I(T;Y)$ 08, the Fubini–Study metric. A degenerate Lagrangian is constructed on $I(T;Y)$ 09 and descends to the metric Lagrangian on $I(T;Y)$ 10. Its Hamilton principal function is the square of the Fubini–Study distance. For a qbit,

$I(T;Y)$ 11

which is exactly the canonical divergence on the pure-state manifold (Ciaglia et al., 2017).

An important caveat is that the functions $I(T;Y)$ 12 are not automatically positive definite or symmetric. They are potential functions in the Amari sense, and the paper notes that higher-order velocity terms may be added to the Lagrangian to enforce positivity and other divergence properties without changing the quadratic and cubic contributions (Ciaglia et al., 2017).

5. Path-space divergences and Lagrangian uncertainty quantification

In stochastic dynamics, the expression “Lagrangian” refers to path-based prediction rather than to particle labels in a finite-dimensional latent model. The reference and approximate systems are specified by stochastic differential equations with Eulerian vector fields and diffusion matrices, but the quantities of interest are statistics of trajectories, that is, functionals $I(T;Y)$ 13 on path space $I(T;Y)$ 14. The true and approximate predictions are

$I(T;Y)$ 15

where $I(T;Y)$ 16 and $I(T;Y)$ 17 are the corresponding path measures. The central problem is to control the Lagrangian prediction error $I(T;Y)$ 18 in terms of divergences between these induced path laws (Branicki et al., 2019).

The framework uses $I(T;Y)$ 19-divergences

$I(T;Y)$ 20

for strictly convex $I(T;Y)$ 21 satisfying $I(T;Y)$ 22 and $I(T;Y)$ 23. Important examples include KL, $I(T;Y)$ 24, Hellinger, total variation, and Rényi-related $I(T;Y)$ 25-divergences. Their variational representation supplies the core information inequality. For $I(T;Y)$ 26 with finite divergence and $I(T;Y)$ 27 in the corresponding Orlicz class,

$I(T;Y)$ 28

where

$I(T;Y)$ 29

Small divergence implies small observable error, and for smooth $I(T;Y)$ 30 one obtains the local expansion

$I(T;Y)$ 31

For $I(T;Y)$ 32 divergence, this expansion becomes exact (Branicki et al., 2019).

A second layer of the theory connects divergences between laws to discrepancies in the underlying Eulerian fields. For time-marginal densities $I(T;Y)$ 33 and $I(T;Y)$ 34, the ratio $I(T;Y)$ 35 and the reconstructed drift

$I(T;Y)$ 36

allow one to represent the true Fokker–Planck equation as a perturbation of the approximate one. Under smoothness and ellipticity assumptions,

$I(T;Y)$ 37

For KL divergence this simplifies to

$I(T;Y)$ 38

Thus the formulation establishes an explicit chain from Eulerian model error to divergence of induced laws and then to error in Lagrangian observables (Branicki et al., 2019).

The paper also introduces finite-time $I(T;Y)$ 39-divergence rates (FTDR). A general bound states

$I(T;Y)$ 40

so discrepancy between the true and approximate marginals is bounded by the difference of their respective expansion rates away from the initial law. This extends to path-space projections and provides a practical model-selection and calibration principle: approximate models that reproduce the FTDR fields of the original dynamics will also control path-space divergence and hence Lagrangian prediction error (Branicki et al., 2019).

6. Parameter invariance, misconceptions, and limitations

A distinct geometric background for information-theoretic Lagrangians is the parameter-invariant calculus developed on Finsler and Kawaguchi manifolds. A Finsler function $I(T;Y)$ 41 satisfies positive homogeneity,

$I(T;Y)$ 42

and defines the Hilbert form

$I(T;Y)$ 43

The action

$I(T;Y)$ 44

is reparameterisation invariant. First-order $I(T;Y)$ 45-dimensional Kawaguchi geometry generalizes this to $I(T;Y)$ 46 with degree-one homogeneity in the multivector variable and action $I(T;Y)$ 47. Second-order formulations impose Zermelo-type homogeneity conditions such as

$I(T;Y)$ 48

again yielding reparameterisation-invariant actions (Tanaka, 2013).

The thesis is not about information theory per se, but it constructs a parameter-independent Lagrangian calculus in which no fibered structure over parameter space is needed and conventional Lagrangians can be reformulated locally by parameter-independent ones. This suggests a geometric stage on which an information density could be encoded as a Finsler or Kawaguchi function, with the resulting action interpreted as a total information measure (Tanaka, 2013).

Several recurrent misconceptions are explicitly corrected by the cited works. In IB, the linear Lagrangian does not generally explore the full trade-off curve; in deterministic settings, multiple $I(T;Y)$ 49 values can map to the same point and a single $I(T;Y)$ 50 can map to multiple points, so “scan over $I(T;Y)$ 51” is not a universal exploration principle (Rodríguez-Gálvez et al., 2019). In latent-variable modeling, fixed multipliers are not intrinsic characteristics of VAE-like objectives but chosen dual weights within a broader constrained optimization problem; the dual-optimization view treats them as variables rather than immutable hyperparameters (Zhao et al., 2018). In information geometry, a Hamilton principal function that reproduces $I(T;Y)$ 52 is not automatically a bona fide divergence, because positivity and symmetry may require higher-order modifications of the Lagrangian (Ciaglia et al., 2017). In stochastic flows, Eulerian closeness of drifts or diffusions does not by itself imply closeness of induced path laws, especially under bifurcations or chaotic dynamics; the relevant quantities for path-based prediction are divergences between the path measures themselves (Branicki et al., 2019).

The main limitations are likewise formulation-specific. The exact mapping $I(T;Y)$ 53 in convex IB requires the IB curve to be known and differentiable; otherwise only bounds or approximate constructions such as the shifted exponential penalty are available (Rodríguez-Gálvez et al., 2019). In generative modeling, strong duality is established only after replacing exact mutual information by convex upper bounds or concave lower bounds compatible with the sign of the objective, and the practical objectives remain dependent on the tractability class of the chosen divergences (Zhao et al., 2018). In Hamilton–Jacobi information geometry, the full global inverse problem for generic divergences is left open outside self-dual cases and special examples such as the exponential family (Ciaglia et al., 2017). In stochastic-flow uncertainty quantification, the generator-based bounds rely on smoothness, ellipticity, and absolute continuity assumptions, while the computational FTDR route trades analytic sharpness for tractability (Branicki et al., 2019).

Taken together, these works show that an information-theoretic Lagrangian formulation can mean at least four precise things: a nonlinear penalty dual for the IB rate–relevance trade-off, a dual representation of mutual-information control under model-consistency constraints, a Hamilton–Jacobi mechanism that turns action into divergence, and a path-space divergence calculus for Lagrangian predictions. The common theme is not a single formula but the systematic use of variational structure to encode information measures as objectives, constraints, or dynamical generating functions (Rodríguez-Gálvez et al., 2019, Zhao et al., 2018, Ciaglia et al., 2017, Branicki et al., 2019).