Lagrangian Self-Distillation (LSD)

Updated 15 February 2026

The paper introduces LSD, which leverages analytic ODE signals to self-distill flow maps, eliminating the need for pre-trained teacher networks.
It parameterizes flow maps using a first-order Taylor expansion and enforces both diagonal and off-diagonal Lagrangian constraints to guide learning.
The approach guarantees convergence in 2-Wasserstein distance and highlights trade-offs between derivative-based and derivative-free methods in different problem dimensions.

Lagrangian Self-Distillation (LSD) is a direct training paradigm for learning flow maps in consistency models, which are generative models defined by solutions to probability flow ordinary differential equations (ODEs). LSD leverages the structure of these flows to construct self-distillation objectives that do not require a pre-trained teacher network. Instead, the framework bootstraps the training signal from analytic properties of the flow, enforcing both the diagonal and off-diagonal constraints implied by the underlying ODE governing sample transformation from a base to a target distribution (Boffi et al., 24 May 2025).

1. Probability Flow ODEs and Flow Maps

LSD operates in the context of continuous-time probability flows that transport an initial base distribution $\rho_0$ (often Gaussian) to a target distribution $\rho_1$ . The model is defined by a probability flow ODE in $\mathbb{R}^d$ : $\dot{x}_t = b_t(x_t), \quad x_0 \sim \rho_0,$ where $b_t$ is the velocity field at time $t$ . This can be constructed using a stochastic interpolant,

$I_t(x_0, x_1) = \alpha_t x_0 + \beta_t x_1,$

with coefficients satisfying $\alpha_0=1$ , $\alpha_1=0$ , $\beta_0=0$ , $\beta_1=1$ and $(x_0, x_1) \sim (\rho_0, \rho_1)$ . The time-derivative is

$\dot{I}_t(x_0, x_1) = \dot{\alpha}_t x_0 + \dot{\beta}_t x_1,$

leading to a velocity field given by conditional expectation,

$b_t(x) = E[\dot{I}_t | I_t = x].$

The two-time flow map $X_{s,t}: \mathbb{R}^d \to \mathbb{R}^d$ is defined by ODE evolution from time $s$ to $t$ with $X_{s,s}(x) = x$ : $X_{s,t}(x) = x_t,$ where $x_t$ satisfies the ODE with $x_s = x$ . Single-step mapping from $\rho_0$ to $\rho_1$ is accomplished by computing $X_{0,1}(x_0)$ .

2. Lagrangian Tangency and Flow Map Parameterization

The flow map $X_{s,t}$ satisfies the so-called Lagrangian (tangent) relation, a PDE in $t$ for each $s$ and $x$ : $\partial_t X_{s,t}(x) = b_t(X_{s,t}(x)), \quad X_{s,s}(x) = x.$ As $s \rightarrow t$ , the rate of change of the flow map recovers the instantaneous velocity field: $\lim_{s \to t} \partial_t X_{s,t}(x) = b_t(x).$ A practical parameterization of the flow map is given by a first-order Taylor expansion: $X_{s,t}(x) = x + (t-s)v_{s,t}(x),$ which implies the diagonal constraint $v_{t,t}(x) = b_t(x)$ .

3. Self-Distillation Loss Construction

LSD trains a single neural network $v_{s,t}(x)$ so that it satisfies two properties:

On the diagonal $s=t$ , $v_{t,t}(x)$ matches $b_t(x)$ (the velocity field).
Off the diagonal $s < t$ , the flow map parameterization satisfies the Lagrangian PDE.

The training objective has two components:

Diagonal (flow matching) loss: $L_b(v) = \int_{0}^{1} E_{x_0,x_1 \sim \rho}\, |v_{t,t}(I_t) - \dot{I}_t|^2\,dt,$ which is uniquely minimized when $v_{t,t}(x) = b_t(x)$ .
Off-diagonal Lagrangian self-distillation loss: $L_D^{\text{LSD}}(v) = \int_{0}^{1}\!\!\int_{0}^{t} E_{x_0,x_1}\, \big|\partial_t \hat{X}_{s,t}(I_s) - v_{t,t}(\hat{X}_{s,t}(I_s))\big|^2\,ds\,dt,$ with $\hat{X}_{s,t}(x) = x + (t-s)v_{s,t}(x)$ and time derivatives computed via automatic differentiation. At the global minimum, $\partial_t X_{s,t}(x) = b_t(X_{s,t}(x))$ holds.
Combined self-distillation loss: $L_{\rm SD}(v) = L_b(v) + L_D^{\text{LSD}}(v)$ Minimization of $L_{\rm SD}$ enforces the tangent (Lagrangian) flow condition everywhere. No external teacher model is used; the analytic diagonal signal $\dot{I}_t$ serves as the regression target.

4. Practical Training Details and Algorithmic Strategies

Several important practical strategies characterize efficient LSD training:

Teacher warmup: Initially train only the diagonal loss $L_b$ ( $s=t$ ) for several thousand steps to guide $v_{t,t}$ toward $b_t$ before introducing off-diagonal terms.
Gradual incorporation of off-diagonal terms: The maximum allowed $|t-s|$ is linearly annealed, smoothly ramping in the Lagrangian self-distillation loss over the off-diagonals.
Sampling over time indices: The time pairs $(s,t)$ are sampled uniformly or with a learned weight over the simplex $0 \leq s \leq t \leq 1$ .
Efficient differentiation: Autodifferentiation primitives (e.g., jvp/vjp in autodiff frameworks) are employed to compute $\partial_t X$ or $\partial_s X$ and spatial gradients $\nabla_x X$ efficiently.

5. Comparative Variants: ESD and PSD

The LSD loss as formulated requires computation of time-derivatives of the network with respect to $t$ . The framework encompasses alternative objectives:

Eulerian Self-Distillation (ESD): Enforces the flow PDE in Eulerian coordinates, introducing a spatial derivative term: $\partial_s X_{s,t}(x) + \nabla_x X_{s,t}(x) v_{s,s}(x) = 0,$ which necessitates spatial gradients $\nabla_x X$ .
Progressive Self-Distillation (PSD): Enforces the semigroup property $X_{s,t} = X_{u,t} \circ X_{s,u}$ using a three-point loss that avoids both time and spatial derivatives.

Empirical results indicate that, on high-dimensional tasks such as image synthesis, both LSD and ESD suffer from high-variance gradients and training instability due to derivative computations, whereas PSD's derivative-free objective yields more stable training and improved FID scores for single- or two-step $X_{0,1}$ . On low-dimensional problems, LSD's derivative-based objectives capture sharp, non-linear features more accurately, as the off-diagonal PDE residual enables learning of steep boundaries in multimodal densities.

6. Theoretical Guarantees and Limitations

The proposed LSD methodology comes with guarantees on generative distribution matching. As $L_{\rm SD} \to 0$ , the 2-Wasserstein distance between the learned law $\mathcal{L}(X_{0,1}(\rho_0))$ and $\rho_1$ converges to zero at a rate $\mathcal{O}(\sqrt{L})$ . That is, minimizing the combined LSD loss provably yields accurate learned flow maps in the sense of optimal transport, bridging the base and target laws via the consistency model approach.

LSD addresses a central limitation of prior distillation schemes for consistency models, namely the dependency on pre-trained teacher networks and multi-stage distillation. By exploiting the analytic diagonal signal and the Lagrangian structure of the flow map, training is converted to a self-distillation procedure.

A frequent misconception is that off-diagonal self-distillation necessarily improves performance in high dimensions. Empirical observations indicate that, in practice, derivative-based objectives such as LSD and ESD can introduce instability and high-variance gradients, especially for image synthesis, whereas PSD provides superior stability and generative metrics in these settings (Boffi et al., 24 May 2025). Conversely, derivative-based LSD is advantageous in low-dimensional problems with sharp boundaries.

The systematic framework outlined in Boffi et al. (2024) and extended in the referenced work demonstrates that objective selection (LSD, ESD, PSD) should be matched to the ambient dimension and the specific properties of the target distribution. LSD, as a member of this taxonomy, is most beneficial where Lagrangian PDE enforcement enables precise geometric shaping of the learned flow.

Markdown Report Issue Upgrade to Chat

References (1)

How to build a consistency model: Learning flow maps via self-distillation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Lagrangian Self-Distillation (CSD).