Papers
Topics
Authors
Recent
2000 character limit reached

Lagrangian Map Distillation

Updated 9 January 2026
  • Lagrangian Map Distillation is a computational framework for learning deterministic mappings between probability measures using least-action principles.
  • It employs neural and spline amortization to bypass costly ODE solvers, enabling fast, one-shot inference in optimal transport and generative modeling.
  • The framework integrates dual, c-transform, and path-energy losses to achieve scalable, geometry-aware transport with state-of-the-art empirical performance.

Lagrangian Map Distillation (LMD) refers to a set of computational frameworks for learning deterministic, one-shot mappings and associated paths between probability measures or samples, subject to least-action or Lagrangian transport costs. LMD principles unlock efficient inference in optimal transport (OT) and generative flow models, notably by amortizing path-solving and optimization steps that would traditionally require expensive ODE solvers or iterative procedures at test time. Building on advances in neural parameterizations and amortized optimization, LMD methods are applicable to a variety of transport settings, including those with complex dynamics or geometric constraints, and furnish state-of-the-art generative models with reduced inference costs (Pooladian et al., 2024, Sabour et al., 17 Jun 2025).

1. Lagrangian Costs and Least-Action Formulations

Lagrangian costs generalize classical OT by incorporating the geometry, dynamics, and constraints of the underlying system via an action functional. For points x,yRdx,y\in\mathbb{R}^d, the Lagrangian cost is

c(x,y)=infγ:γ(0)=x,γ(1)=y01L(γ(t),γ˙(t))dt,c(x, y) = \inf_{\gamma: \gamma(0) = x,\, \gamma(1) = y} \int_0^1 L(\gamma(t), \dot{\gamma}(t)) dt,

where the Lagrangian L(x,v)L(x, v) is typically chosen as one of:

  • Kinetic cost (“Benamou–Brenier”): L(x,v)=12v2L(x, v) = \frac{1}{2}\|v\|^2.
  • Kinetic + Potential (obstacles/barriers): L(x,v)=12v2U(x)L(x, v) = \frac{1}{2}\|v\|^2 - U(x), with UU penalizing undesirable regions.
  • Position-dependent Riemannian metrics: L(x,v)=12vA(x)vL(x, v) = \frac{1}{2} v^\top A(x) v, A(x)S++dA(x) \in S_{++}^d encoding non-Euclidean geometry.

The associated action functional J[γ]=01L(γ(t),γ˙(t))dtJ[\gamma] = \int_0^1 L(\gamma(t), \dot{\gamma}(t)) dt defines a geodesic with unique minimizer γ\gamma^* under mild regularity conditions, simultaneously yielding c(x,y)c(x, y) and the optimal path (Pooladian et al., 2024).

2. LMD for Optimal Transport: Neural and Spline Amortization

The LMD recipe for OT underlies recent “neural optimal transport with Lagrangian costs” approaches. The goal is to distill in a single training stage:

  • A Kantorovich potential gθ(y)g_\theta(y), parameterized by an MLP, for dual optimization.
  • An amortized map predictor yζ(x)y_\zeta(x), also an MLP, to approximate T(x)=argminy{c(x,y)gθ(y)}T(x) = \operatorname{argmin}_y \{c(x, y) - g_\theta(y)\}.
  • A spline path predictor φη(x,y)\varphi_\eta(x, y), amortizing the coefficients for cubic spline representations of geodesic paths.

Training alternates three batch Monte Carlo losses:

  1. Dual objective (dual(θ)\ell_\mathrm{dual}(\theta)):

dual=Exμ[gθc(x)]+Eyν[gθ(y)],\ell_\mathrm{dual} = \mathbb{E}_{x\sim\mu}[g_\theta^c(x)] + \mathbb{E}_{y\sim\nu}[g_\theta(y)],

where gθc(x)=c(x,y^)gθ(y^)g_\theta^c(x) = c(x, \hat{y}) - g_\theta(\hat{y}), with y^\hat{y} obtained by warm-started L-BFGS from yζ(x)y_\zeta(x).

  1. cc-transform amortization (camor(ζ)\ell_{c-\mathrm{amor}}(\zeta)):

camor=Exμ[y^(x)yζ(x)2].\ell_{c-\mathrm{amor}} = \mathbb{E}_{x\sim\mu}\left[\|\hat{y}^*(x) - y_\zeta(x)\|^2\right].

  1. Path-energy amortization (path(η)\ell_\mathrm{path}(\eta)):

path=Exμ[E(γφη(x,y^(x));x,y^(x))],\ell_\mathrm{path} = \mathbb{E}_{x\sim\mu}\left[E\left(\gamma_{\varphi_\eta(x, \hat{y}(x))}; x, \hat{y}(x)\right)\right],

with E(γ;x,y)=01L(γ(t),γ˙(t))dtE(\gamma; x, y) = \int_0^1 L(\gamma(t), \dot{\gamma}(t)) dt, and γφ\gamma_\varphi given by the amortized spline.

Optimization proceeds by stochastic, highly vectorized updates (typically via Adam) of (θ,ζ,η)(\theta, \zeta, \eta). L-BFGS is used only during training to refine yy for the dual objective and quickly amortized out (Pooladian et al., 2024).

3. LMD in Continuous-Time Flow Map Distillation

In generative modeling, LMD appears within the “Align Your Flow” (AYF) framework as a variant of map distillation from score-based or flow-matching diffusion models. Continuous-time flow maps are neural networks fθ(xt,t,s)f_\theta(x_t, t, s) that predict xsx_s conditioned on xtx_t and ss, such that fθ(xt,t,t)=xtf_\theta(x_t, t, t) = x_t and fθ(xt,t,0)x0f_\theta(x_t, t, 0) \approx x_0. The parametrization is

fθ(xt,t,s)=xt+(st)Fθ(xt,t,s),f_\theta(x_t, t, s) = x_t + (s - t) F_\theta(x_t, t, s),

where FθF_\theta approximates the score or velocity field (Sabour et al., 17 Jun 2025).

The LMD loss in AYF is

LLMD(ϵ)(θ)=Ext,t,s[w(t,s)fθ(xt,t,s)ODEss(fθ(xt,t,s))2],\mathcal{L}^{(\epsilon)}_{\mathrm{LMD}}(\theta) = \mathbb{E}_{x_t, t, s} \Bigl[ w(t, s) \| f_\theta(x_t, t, s) - \mathrm{ODE}_{s' \to s}(f_{\theta^-}(x_t, t, s')) \|^2 \Bigr],

with s=s+ϵ(ts)s' = s + \epsilon(t - s) and the ODE operator propagating the teacher’s output backwards from ss' to ss. In the ϵ0\epsilon \to 0 limit, this enforces end-point consistency with the velocity field, serving as a Lagrangian constraint.

The training loop involves:

  • Sampling x0x_0 from data, x1x_1 from noise, forming xt=(1t)x0+tx1x_t=(1-t)x_0 + t x_1 for t,s[0,1]t,s\in[0,1].
  • Constructing autoguided teacher velocities vguided(x,t)v_{\rm guided}(x, t) as convex combinations of the main and weak teachers.
  • Computing the student prediction x^=fθ(xt,t,s)\hat{x} = f_\theta(x_t, t, s) and the target y~\widetilde{y} as above.
  • Applying consistency and linearity regularization during training.
  • Using AdamW optimizers with large-batch GPU vectorization (Sabour et al., 17 Jun 2025).

4. Computational Advantages and Inference Procedures

LMD achieves fast, single-forward-pass inference of deterministic transport maps or generative flows without requiring inner optimization or ODE solves at test time:

  • For OT (static map): T(x)yζ(x)T(x) \approx y_\zeta^*(x) via MLP, paths via φη(x,yζ(x))\varphi_\eta^*(x, y_\zeta^*(x)).
  • For flow distillation: fθ(xt,t,s)f_\theta(x_t, t, s) gives xsx_s in one evaluation for arbitrary t,st,s.

Amortized neural predictors eliminate expensive iterative procedures at inference. In OT, spline basis evaluation yields fast geodesic reconstruction. In flow models, batch inference is supported by compact UNet/MLP architectures (Pooladian et al., 2024, Sabour et al., 17 Jun 2025).

The following table summarizes key computational elements:

Setting Main Predictors Runtime Inference
OT (Pooladian et al., 2024) yζ(x)y_\zeta(x), φη(x,y)\varphi_\eta(x, y) One MLP + spline eval
AYF (Sabour et al., 17 Jun 2025) fθ(xt,t,s)f_\theta(x_t, t, s) One UNet/MLP pass

This approach is especially advantageous in high-throughput or large-scale applications, as demonstrated in low-dimensional OT examples and in state-of-the-art generative modeling on ImageNet (Pooladian et al., 2024, Sabour et al., 17 Jun 2025).

5. Architectural, Regularization, and Training Considerations

The neural components are designed for efficiency:

  • MLP parameterizations for gθg_\theta and yζy_\zeta use four layers of 64 leaky ReLU units; splines via a 2×10242\times 1024 MLP.
  • Backpropagation through splines is employed for updating η\eta.
  • Loss alternation is utilized for joint optimization: maximizing the dual, regressing to amortized cc-transforms, and minimizing path energies.
  • Warm-started L-BFGS (with backtracking line search) is used during training, not inference, to stabilize dual optimization.

In AYF, tangent warm-up, autoguidance (convex combinations of strong/weak teachers), and time embedding stabilization are critical for stable training and state-of-the-art sample quality. Adversarial fine-tuning with a StyleGAN2-based discriminator further improves sample FID without significantly harming recall (Sabour et al., 17 Jun 2025).

6. Theoretical Guarantees and Empirical Performance

Theoretical analysis in the generative context establishes that flow maps generalized through LMD maintain stability as the number of function evaluations (NFE) varies, in contrast to standard consistency models where error compounds as NFE increases. In OT, the LMD approach guarantees recovery of action-minimizing maps under standard convexity and smoothness assumptions of the Lagrangian.

Empirical benchmarks show:

  • OT with general Lagrangians: Efficient geodesic computation and map extraction even for non-Euclidean or constrained systems, outperforming classical approaches in scalability and flexibility (Pooladian et al., 2024).
  • ImageNet 64×64, 512×512 (AYF): 2–4 step FID scores set new state-of-the-art for few-step image generation. For example, AYF (no GAN) achieves FID 1.25 at 2 steps, 1.15 at 4 steps; adversarial tuning further reduces FID, with only a minor decrease in recall (Sabour et al., 17 Jun 2025).
  • Text-to-image: AYF-variant showed user preference of ≈60% against LoRA-CM baselines in a 47-person study.

A practical implication is that LMD-based methods provide fast, scalable, geometry- or constraint-aware mappings and generations, efficiently amortizing complex computations while maintaining generation quality even with very few neural evaluations.

7. Practical Guidelines and Recommendations

For OT tasks with general Lagrangian costs:

  • Use cubic spline path parameterizations and MLP amortization for both maps and paths.
  • Train with batched dual, c-transform, and path energy losses, alternating updates.

For generative flows:

  • AYF recommends Eulerian Map Distillation (EMD) for stability over LMD losses.
  • Adopt teacher autoguidance with λUnif[1,3]\lambda \sim \mathrm{Unif}[1,3].
  • Use fθ(x,t,s)=x+(st)Fθ(x,t,s)f_\theta(x, t, s) = x + (s-t) F_\theta(x, t, s) parameterization; time embeddings via MLP.
  • Apply tangent warm-up (linearity ramping) and tangent normalization.
  • Adversarial fine-tuning is optional but effective for maximizing sample quality at very low NFE.

These procedures are supported by publicly released implementations for reproducibility (Pooladian et al., 2024). The frameworks are equally applicable to low-dimensional OT (with general geometric constraints) as well as high-dimensional generative modeling benchmarks (Pooladian et al., 2024, Sabour et al., 17 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Lagrangian Map Distillation (LMD).