Lagrangian Map Distillation
- Lagrangian Map Distillation is a computational framework for learning deterministic mappings between probability measures using least-action principles.
- It employs neural and spline amortization to bypass costly ODE solvers, enabling fast, one-shot inference in optimal transport and generative modeling.
- The framework integrates dual, c-transform, and path-energy losses to achieve scalable, geometry-aware transport with state-of-the-art empirical performance.
Lagrangian Map Distillation (LMD) refers to a set of computational frameworks for learning deterministic, one-shot mappings and associated paths between probability measures or samples, subject to least-action or Lagrangian transport costs. LMD principles unlock efficient inference in optimal transport (OT) and generative flow models, notably by amortizing path-solving and optimization steps that would traditionally require expensive ODE solvers or iterative procedures at test time. Building on advances in neural parameterizations and amortized optimization, LMD methods are applicable to a variety of transport settings, including those with complex dynamics or geometric constraints, and furnish state-of-the-art generative models with reduced inference costs (Pooladian et al., 2024, Sabour et al., 17 Jun 2025).
1. Lagrangian Costs and Least-Action Formulations
Lagrangian costs generalize classical OT by incorporating the geometry, dynamics, and constraints of the underlying system via an action functional. For points , the Lagrangian cost is
where the Lagrangian is typically chosen as one of:
- Kinetic cost (“Benamou–Brenier”): .
- Kinetic + Potential (obstacles/barriers): , with penalizing undesirable regions.
- Position-dependent Riemannian metrics: , encoding non-Euclidean geometry.
The associated action functional defines a geodesic with unique minimizer under mild regularity conditions, simultaneously yielding and the optimal path (Pooladian et al., 2024).
2. LMD for Optimal Transport: Neural and Spline Amortization
The LMD recipe for OT underlies recent “neural optimal transport with Lagrangian costs” approaches. The goal is to distill in a single training stage:
- A Kantorovich potential , parameterized by an MLP, for dual optimization.
- An amortized map predictor , also an MLP, to approximate .
- A spline path predictor , amortizing the coefficients for cubic spline representations of geodesic paths.
Training alternates three batch Monte Carlo losses:
- Dual objective ():
where , with obtained by warm-started L-BFGS from .
- -transform amortization ():
- Path-energy amortization ():
with , and given by the amortized spline.
Optimization proceeds by stochastic, highly vectorized updates (typically via Adam) of . L-BFGS is used only during training to refine for the dual objective and quickly amortized out (Pooladian et al., 2024).
3. LMD in Continuous-Time Flow Map Distillation
In generative modeling, LMD appears within the “Align Your Flow” (AYF) framework as a variant of map distillation from score-based or flow-matching diffusion models. Continuous-time flow maps are neural networks that predict conditioned on and , such that and . The parametrization is
where approximates the score or velocity field (Sabour et al., 17 Jun 2025).
The LMD loss in AYF is
with and the ODE operator propagating the teacher’s output backwards from to . In the limit, this enforces end-point consistency with the velocity field, serving as a Lagrangian constraint.
The training loop involves:
- Sampling from data, from noise, forming for .
- Constructing autoguided teacher velocities as convex combinations of the main and weak teachers.
- Computing the student prediction and the target as above.
- Applying consistency and linearity regularization during training.
- Using AdamW optimizers with large-batch GPU vectorization (Sabour et al., 17 Jun 2025).
4. Computational Advantages and Inference Procedures
LMD achieves fast, single-forward-pass inference of deterministic transport maps or generative flows without requiring inner optimization or ODE solves at test time:
- For OT (static map): via MLP, paths via .
- For flow distillation: gives in one evaluation for arbitrary .
Amortized neural predictors eliminate expensive iterative procedures at inference. In OT, spline basis evaluation yields fast geodesic reconstruction. In flow models, batch inference is supported by compact UNet/MLP architectures (Pooladian et al., 2024, Sabour et al., 17 Jun 2025).
The following table summarizes key computational elements:
| Setting | Main Predictors | Runtime Inference |
|---|---|---|
| OT (Pooladian et al., 2024) | , | One MLP + spline eval |
| AYF (Sabour et al., 17 Jun 2025) | One UNet/MLP pass |
This approach is especially advantageous in high-throughput or large-scale applications, as demonstrated in low-dimensional OT examples and in state-of-the-art generative modeling on ImageNet (Pooladian et al., 2024, Sabour et al., 17 Jun 2025).
5. Architectural, Regularization, and Training Considerations
The neural components are designed for efficiency:
- MLP parameterizations for and use four layers of 64 leaky ReLU units; splines via a MLP.
- Backpropagation through splines is employed for updating .
- Loss alternation is utilized for joint optimization: maximizing the dual, regressing to amortized -transforms, and minimizing path energies.
- Warm-started L-BFGS (with backtracking line search) is used during training, not inference, to stabilize dual optimization.
In AYF, tangent warm-up, autoguidance (convex combinations of strong/weak teachers), and time embedding stabilization are critical for stable training and state-of-the-art sample quality. Adversarial fine-tuning with a StyleGAN2-based discriminator further improves sample FID without significantly harming recall (Sabour et al., 17 Jun 2025).
6. Theoretical Guarantees and Empirical Performance
Theoretical analysis in the generative context establishes that flow maps generalized through LMD maintain stability as the number of function evaluations (NFE) varies, in contrast to standard consistency models where error compounds as NFE increases. In OT, the LMD approach guarantees recovery of action-minimizing maps under standard convexity and smoothness assumptions of the Lagrangian.
Empirical benchmarks show:
- OT with general Lagrangians: Efficient geodesic computation and map extraction even for non-Euclidean or constrained systems, outperforming classical approaches in scalability and flexibility (Pooladian et al., 2024).
- ImageNet 64×64, 512×512 (AYF): 2–4 step FID scores set new state-of-the-art for few-step image generation. For example, AYF (no GAN) achieves FID 1.25 at 2 steps, 1.15 at 4 steps; adversarial tuning further reduces FID, with only a minor decrease in recall (Sabour et al., 17 Jun 2025).
- Text-to-image: AYF-variant showed user preference of ≈60% against LoRA-CM baselines in a 47-person study.
A practical implication is that LMD-based methods provide fast, scalable, geometry- or constraint-aware mappings and generations, efficiently amortizing complex computations while maintaining generation quality even with very few neural evaluations.
7. Practical Guidelines and Recommendations
For OT tasks with general Lagrangian costs:
- Use cubic spline path parameterizations and MLP amortization for both maps and paths.
- Train with batched dual, c-transform, and path energy losses, alternating updates.
For generative flows:
- AYF recommends Eulerian Map Distillation (EMD) for stability over LMD losses.
- Adopt teacher autoguidance with .
- Use parameterization; time embeddings via MLP.
- Apply tangent warm-up (linearity ramping) and tangent normalization.
- Adversarial fine-tuning is optional but effective for maximizing sample quality at very low NFE.
These procedures are supported by publicly released implementations for reproducibility (Pooladian et al., 2024). The frameworks are equally applicable to low-dimensional OT (with general geometric constraints) as well as high-dimensional generative modeling benchmarks (Pooladian et al., 2024, Sabour et al., 17 Jun 2025).