Lagrangian Map Distillation

Updated 9 January 2026

Lagrangian Map Distillation is a computational framework for learning deterministic mappings between probability measures using least-action principles.
It employs neural and spline amortization to bypass costly ODE solvers, enabling fast, one-shot inference in optimal transport and generative modeling.
The framework integrates dual, c-transform, and path-energy losses to achieve scalable, geometry-aware transport with state-of-the-art empirical performance.

Lagrangian Map Distillation (LMD) refers to a set of computational frameworks for learning deterministic, one-shot mappings and associated paths between probability measures or samples, subject to least-action or Lagrangian transport costs. LMD principles unlock efficient inference in optimal transport (OT) and generative flow models, notably by amortizing path-solving and optimization steps that would traditionally require expensive ODE solvers or iterative procedures at test time. Building on advances in neural parameterizations and amortized optimization, LMD methods are applicable to a variety of transport settings, including those with complex dynamics or geometric constraints, and furnish state-of-the-art generative models with reduced inference costs (Pooladian et al., 2024, Sabour et al., 17 Jun 2025).

1. Lagrangian Costs and Least-Action Formulations

Lagrangian costs generalize classical OT by incorporating the geometry, dynamics, and constraints of the underlying system via an action functional. For points $x,y\in\mathbb{R}^d$ , the Lagrangian cost is

$c(x, y) = \inf_{\gamma: \gamma(0) = x,\, \gamma(1) = y} \int_0^1 L(\gamma(t), \dot{\gamma}(t)) dt,$

where the Lagrangian $L(x, v)$ is typically chosen as one of:

Kinetic cost (“Benamou–Brenier”): $L(x, v) = \frac{1}{2}\|v\|^2$ .
Kinetic + Potential (obstacles/barriers): $L(x, v) = \frac{1}{2}\|v\|^2 - U(x)$ , with $U$ penalizing undesirable regions.
Position-dependent Riemannian metrics: $L(x, v) = \frac{1}{2} v^\top A(x) v$ , $A(x) \in S_{++}^d$ encoding non-Euclidean geometry.

The associated action functional $J[\gamma] = \int_0^1 L(\gamma(t), \dot{\gamma}(t)) dt$ defines a geodesic with unique minimizer $\gamma^*$ under mild regularity conditions, simultaneously yielding $c(x, y)$ and the optimal path (Pooladian et al., 2024).

2. LMD for Optimal Transport: Neural and Spline Amortization

The LMD recipe for OT underlies recent “neural optimal transport with Lagrangian costs” approaches. The goal is to distill in a single training stage:

A Kantorovich potential $g_\theta(y)$ , parameterized by an MLP, for dual optimization.
An amortized map predictor $y_\zeta(x)$ , also an MLP, to approximate $T(x) = \operatorname{argmin}_y \{c(x, y) - g_\theta(y)\}$ .
A spline path predictor $\varphi_\eta(x, y)$ , amortizing the coefficients for cubic spline representations of geodesic paths.

Training alternates three batch Monte Carlo losses:

Dual objective ( $\ell_\mathrm{dual}(\theta)$ ):

$\ell_\mathrm{dual} = \mathbb{E}_{x\sim\mu}[g_\theta^c(x)] + \mathbb{E}_{y\sim\nu}[g_\theta(y)],$

where $g_\theta^c(x) = c(x, \hat{y}) - g_\theta(\hat{y})$ , with $\hat{y}$ obtained by warm-started L-BFGS from $y_\zeta(x)$ .

$c$ -transform amortization ( $\ell_{c-\mathrm{amor}}(\zeta)$ ):

$\ell_{c-\mathrm{amor}} = \mathbb{E}_{x\sim\mu}\left[\|\hat{y}^*(x) - y_\zeta(x)\|^2\right].$

Path-energy amortization ( $\ell_\mathrm{path}(\eta)$ ):

$\ell_\mathrm{path} = \mathbb{E}_{x\sim\mu}\left[E\left(\gamma_{\varphi_\eta(x, \hat{y}(x))}; x, \hat{y}(x)\right)\right],$

with $E(\gamma; x, y) = \int_0^1 L(\gamma(t), \dot{\gamma}(t)) dt$ , and $\gamma_\varphi$ given by the amortized spline.

Optimization proceeds by stochastic, highly vectorized updates (typically via Adam) of $(\theta, \zeta, \eta)$ . L-BFGS is used only during training to refine $y$ for the dual objective and quickly amortized out (Pooladian et al., 2024).

3. LMD in Continuous-Time Flow Map Distillation

In generative modeling, LMD appears within the “Align Your Flow” (AYF) framework as a variant of map distillation from score-based or flow-matching diffusion models. Continuous-time flow maps are neural networks $f_\theta(x_t, t, s)$ that predict $x_s$ conditioned on $x_t$ and $s$ , such that $f_\theta(x_t, t, t) = x_t$ and $f_\theta(x_t, t, 0) \approx x_0$ . The parametrization is

$f_\theta(x_t, t, s) = x_t + (s - t) F_\theta(x_t, t, s),$

where $F_\theta$ approximates the score or velocity field (Sabour et al., 17 Jun 2025).

The LMD loss in AYF is

$\mathcal{L}^{(\epsilon)}_{\mathrm{LMD}}(\theta) = \mathbb{E}_{x_t, t, s} \Bigl[ w(t, s) \| f_\theta(x_t, t, s) - \mathrm{ODE}_{s' \to s}(f_{\theta^-}(x_t, t, s')) \|^2 \Bigr],$

with $s' = s + \epsilon(t - s)$ and the ODE operator propagating the teacher’s output backwards from $s'$ to $s$ . In the $\epsilon \to 0$ limit, this enforces end-point consistency with the velocity field, serving as a Lagrangian constraint.

The training loop involves:

Sampling $x_0$ from data, $x_1$ from noise, forming $x_t=(1-t)x_0 + t x_1$ for $t,s\in[0,1]$ .
Constructing autoguided teacher velocities $v_{\rm guided}(x, t)$ as convex combinations of the main and weak teachers.
Computing the student prediction $\hat{x} = f_\theta(x_t, t, s)$ and the target $\widetilde{y}$ as above.
Applying consistency and linearity regularization during training.
Using AdamW optimizers with large-batch GPU vectorization (Sabour et al., 17 Jun 2025).

4. Computational Advantages and Inference Procedures

LMD achieves fast, single-forward-pass inference of deterministic transport maps or generative flows without requiring inner optimization or ODE solves at test time:

For OT (static map): $T(x) \approx y_\zeta^*(x)$ via MLP, paths via $\varphi_\eta^*(x, y_\zeta^*(x))$ .
For flow distillation: $f_\theta(x_t, t, s)$ gives $x_s$ in one evaluation for arbitrary $t,s$ .

Amortized neural predictors eliminate expensive iterative procedures at inference. In OT, spline basis evaluation yields fast geodesic reconstruction. In flow models, batch inference is supported by compact UNet/MLP architectures (Pooladian et al., 2024, Sabour et al., 17 Jun 2025).

The following table summarizes key computational elements:

Setting	Main Predictors	Runtime Inference
OT (Pooladian et al., 2024)	$y_\zeta(x)$ , $\varphi_\eta(x, y)$	One MLP + spline eval
AYF (Sabour et al., 17 Jun 2025)	$f_\theta(x_t, t, s)$	One UNet/MLP pass

This approach is especially advantageous in high-throughput or large-scale applications, as demonstrated in low-dimensional OT examples and in state-of-the-art generative modeling on ImageNet (Pooladian et al., 2024, Sabour et al., 17 Jun 2025).

5. Architectural, Regularization, and Training Considerations

The neural components are designed for efficiency:

MLP parameterizations for $g_\theta$ and $y_\zeta$ use four layers of 64 leaky ReLU units; splines via a $2\times 1024$ MLP.
Backpropagation through splines is employed for updating $\eta$ .
Loss alternation is utilized for joint optimization: maximizing the dual, regressing to amortized $c$ -transforms, and minimizing path energies.
Warm-started L-BFGS (with backtracking line search) is used during training, not inference, to stabilize dual optimization.

In AYF, tangent warm-up, autoguidance (convex combinations of strong/weak teachers), and time embedding stabilization are critical for stable training and state-of-the-art sample quality. Adversarial fine-tuning with a StyleGAN2-based discriminator further improves sample FID without significantly harming recall (Sabour et al., 17 Jun 2025).

6. Theoretical Guarantees and Empirical Performance

Theoretical analysis in the generative context establishes that flow maps generalized through LMD maintain stability as the number of function evaluations (NFE) varies, in contrast to standard consistency models where error compounds as NFE increases. In OT, the LMD approach guarantees recovery of action-minimizing maps under standard convexity and smoothness assumptions of the Lagrangian.

Empirical benchmarks show:

OT with general Lagrangians: Efficient geodesic computation and map extraction even for non-Euclidean or constrained systems, outperforming classical approaches in scalability and flexibility (Pooladian et al., 2024).
ImageNet 64×64, 512×512 (AYF): 2–4 step FID scores set new state-of-the-art for few-step image generation. For example, AYF (no GAN) achieves FID 1.25 at 2 steps, 1.15 at 4 steps; adversarial tuning further reduces FID, with only a minor decrease in recall (Sabour et al., 17 Jun 2025).
Text-to-image: AYF-variant showed user preference of ≈60% against LoRA-CM baselines in a 47-person study.

A practical implication is that LMD-based methods provide fast, scalable, geometry- or constraint-aware mappings and generations, efficiently amortizing complex computations while maintaining generation quality even with very few neural evaluations.

7. Practical Guidelines and Recommendations

For OT tasks with general Lagrangian costs:

Use cubic spline path parameterizations and MLP amortization for both maps and paths.
Train with batched dual, c-transform, and path energy losses, alternating updates.

For generative flows:

AYF recommends Eulerian Map Distillation (EMD) for stability over LMD losses.
Adopt teacher autoguidance with $\lambda \sim \mathrm{Unif}[1,3]$ .
Use $f_\theta(x, t, s) = x + (s-t) F_\theta(x, t, s)$ parameterization; time embeddings via MLP.
Apply tangent warm-up (linearity ramping) and tangent normalization.
Adversarial fine-tuning is optional but effective for maximizing sample quality at very low NFE.

These procedures are supported by publicly released implementations for reproducibility (Pooladian et al., 2024). The frameworks are equally applicable to low-dimensional OT (with general geometric constraints) as well as high-dimensional generative modeling benchmarks (Pooladian et al., 2024, Sabour et al., 17 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Neural Optimal Transport with Lagrangian Costs (2024)

Align Your Flow: Scaling Continuous-Time Flow Map Distillation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Lagrangian Map Distillation (LMD).

Lagrangian Map Distillation

1. Lagrangian Costs and Least-Action Formulations

2. LMD for Optimal Transport: Neural and Spline Amortization

3. LMD in Continuous-Time Flow Map Distillation

4. Computational Advantages and Inference Procedures

5. Architectural, Regularization, and Training Considerations

6. Theoretical Guarantees and Empirical Performance

7. Practical Guidelines and Recommendations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Lagrangian Map Distillation

1. Lagrangian Costs and Least-Action Formulations

2. LMD for Optimal Transport: Neural and Spline Amortization

3. LMD in Continuous-Time Flow Map Distillation

4. Computational Advantages and Inference Procedures

5. Architectural, Regularization, and Training Considerations

6. Theoretical Guarantees and Empirical Performance

7. Practical Guidelines and Recommendations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research