Lagrangian Self-Distillation (LSD)
- The paper introduces LSD, which leverages analytic ODE signals to self-distill flow maps, eliminating the need for pre-trained teacher networks.
- It parameterizes flow maps using a first-order Taylor expansion and enforces both diagonal and off-diagonal Lagrangian constraints to guide learning.
- The approach guarantees convergence in 2-Wasserstein distance and highlights trade-offs between derivative-based and derivative-free methods in different problem dimensions.
Lagrangian Self-Distillation (LSD) is a direct training paradigm for learning flow maps in consistency models, which are generative models defined by solutions to probability flow ordinary differential equations (ODEs). LSD leverages the structure of these flows to construct self-distillation objectives that do not require a pre-trained teacher network. Instead, the framework bootstraps the training signal from analytic properties of the flow, enforcing both the diagonal and off-diagonal constraints implied by the underlying ODE governing sample transformation from a base to a target distribution (Boffi et al., 24 May 2025).
1. Probability Flow ODEs and Flow Maps
LSD operates in the context of continuous-time probability flows that transport an initial base distribution (often Gaussian) to a target distribution . The model is defined by a probability flow ODE in : where is the velocity field at time . This can be constructed using a stochastic interpolant,
with coefficients satisfying , , , and . The time-derivative is
leading to a velocity field given by conditional expectation,
The two-time flow map is defined by ODE evolution from time to with : where satisfies the ODE with . Single-step mapping from to is accomplished by computing .
2. Lagrangian Tangency and Flow Map Parameterization
The flow map satisfies the so-called Lagrangian (tangent) relation, a PDE in for each and : As , the rate of change of the flow map recovers the instantaneous velocity field: A practical parameterization of the flow map is given by a first-order Taylor expansion: which implies the diagonal constraint .
3. Self-Distillation Loss Construction
LSD trains a single neural network so that it satisfies two properties:
- On the diagonal , matches (the velocity field).
- Off the diagonal , the flow map parameterization satisfies the Lagrangian PDE.
The training objective has two components:
- Diagonal (flow matching) loss: which is uniquely minimized when .
- Off-diagonal Lagrangian self-distillation loss: with and time derivatives computed via automatic differentiation. At the global minimum, holds.
- Combined self-distillation loss: Minimization of enforces the tangent (Lagrangian) flow condition everywhere. No external teacher model is used; the analytic diagonal signal serves as the regression target.
4. Practical Training Details and Algorithmic Strategies
Several important practical strategies characterize efficient LSD training:
- Teacher warmup: Initially train only the diagonal loss () for several thousand steps to guide toward before introducing off-diagonal terms.
- Gradual incorporation of off-diagonal terms: The maximum allowed is linearly annealed, smoothly ramping in the Lagrangian self-distillation loss over the off-diagonals.
- Sampling over time indices: The time pairs are sampled uniformly or with a learned weight over the simplex .
- Efficient differentiation: Autodifferentiation primitives (e.g., jvp/vjp in autodiff frameworks) are employed to compute or and spatial gradients efficiently.
5. Comparative Variants: ESD and PSD
The LSD loss as formulated requires computation of time-derivatives of the network with respect to . The framework encompasses alternative objectives:
- Eulerian Self-Distillation (ESD): Enforces the flow PDE in Eulerian coordinates, introducing a spatial derivative term: which necessitates spatial gradients .
- Progressive Self-Distillation (PSD): Enforces the semigroup property using a three-point loss that avoids both time and spatial derivatives.
Empirical results indicate that, on high-dimensional tasks such as image synthesis, both LSD and ESD suffer from high-variance gradients and training instability due to derivative computations, whereas PSD's derivative-free objective yields more stable training and improved FID scores for single- or two-step . On low-dimensional problems, LSD's derivative-based objectives capture sharp, non-linear features more accurately, as the off-diagonal PDE residual enables learning of steep boundaries in multimodal densities.
6. Theoretical Guarantees and Limitations
The proposed LSD methodology comes with guarantees on generative distribution matching. As , the 2-Wasserstein distance between the learned law and converges to zero at a rate . That is, minimizing the combined LSD loss provably yields accurate learned flow maps in the sense of optimal transport, bridging the base and target laws via the consistency model approach.
7. Context, Misconceptions, and Related Work
LSD addresses a central limitation of prior distillation schemes for consistency models, namely the dependency on pre-trained teacher networks and multi-stage distillation. By exploiting the analytic diagonal signal and the Lagrangian structure of the flow map, training is converted to a self-distillation procedure.
A frequent misconception is that off-diagonal self-distillation necessarily improves performance in high dimensions. Empirical observations indicate that, in practice, derivative-based objectives such as LSD and ESD can introduce instability and high-variance gradients, especially for image synthesis, whereas PSD provides superior stability and generative metrics in these settings (Boffi et al., 24 May 2025). Conversely, derivative-based LSD is advantageous in low-dimensional problems with sharp boundaries.
The systematic framework outlined in Boffi et al. (2024) and extended in the referenced work demonstrates that objective selection (LSD, ESD, PSD) should be matched to the ambient dimension and the specific properties of the target distribution. LSD, as a member of this taxonomy, is most beneficial where Lagrangian PDE enforcement enables precise geometric shaping of the learned flow.