Papers
Topics
Authors
Recent
Search
2000 character limit reached

Lagrangian Self-Distillation (LSD)

Updated 15 February 2026
  • The paper introduces LSD, which leverages analytic ODE signals to self-distill flow maps, eliminating the need for pre-trained teacher networks.
  • It parameterizes flow maps using a first-order Taylor expansion and enforces both diagonal and off-diagonal Lagrangian constraints to guide learning.
  • The approach guarantees convergence in 2-Wasserstein distance and highlights trade-offs between derivative-based and derivative-free methods in different problem dimensions.

Lagrangian Self-Distillation (LSD) is a direct training paradigm for learning flow maps in consistency models, which are generative models defined by solutions to probability flow ordinary differential equations (ODEs). LSD leverages the structure of these flows to construct self-distillation objectives that do not require a pre-trained teacher network. Instead, the framework bootstraps the training signal from analytic properties of the flow, enforcing both the diagonal and off-diagonal constraints implied by the underlying ODE governing sample transformation from a base to a target distribution (Boffi et al., 24 May 2025).

1. Probability Flow ODEs and Flow Maps

LSD operates in the context of continuous-time probability flows that transport an initial base distribution ρ0\rho_0 (often Gaussian) to a target distribution ρ1\rho_1. The model is defined by a probability flow ODE in Rd\mathbb{R}^d: x˙t=bt(xt),x0ρ0,\dot{x}_t = b_t(x_t), \quad x_0 \sim \rho_0, where btb_t is the velocity field at time tt. This can be constructed using a stochastic interpolant,

It(x0,x1)=αtx0+βtx1,I_t(x_0, x_1) = \alpha_t x_0 + \beta_t x_1,

with coefficients satisfying α0=1\alpha_0=1, α1=0\alpha_1=0, β0=0\beta_0=0, β1=1\beta_1=1 and (x0,x1)(ρ0,ρ1)(x_0, x_1) \sim (\rho_0, \rho_1). The time-derivative is

I˙t(x0,x1)=α˙tx0+β˙tx1,\dot{I}_t(x_0, x_1) = \dot{\alpha}_t x_0 + \dot{\beta}_t x_1,

leading to a velocity field given by conditional expectation,

bt(x)=E[I˙tIt=x].b_t(x) = E[\dot{I}_t | I_t = x].

The two-time flow map Xs,t:RdRdX_{s,t}: \mathbb{R}^d \to \mathbb{R}^d is defined by ODE evolution from time ss to tt with Xs,s(x)=xX_{s,s}(x) = x: Xs,t(x)=xt,X_{s,t}(x) = x_t, where xtx_t satisfies the ODE with xs=xx_s = x. Single-step mapping from ρ0\rho_0 to ρ1\rho_1 is accomplished by computing X0,1(x0)X_{0,1}(x_0).

2. Lagrangian Tangency and Flow Map Parameterization

The flow map Xs,tX_{s,t} satisfies the so-called Lagrangian (tangent) relation, a PDE in tt for each ss and xx: tXs,t(x)=bt(Xs,t(x)),Xs,s(x)=x.\partial_t X_{s,t}(x) = b_t(X_{s,t}(x)), \quad X_{s,s}(x) = x. As sts \rightarrow t, the rate of change of the flow map recovers the instantaneous velocity field: limsttXs,t(x)=bt(x).\lim_{s \to t} \partial_t X_{s,t}(x) = b_t(x). A practical parameterization of the flow map is given by a first-order Taylor expansion: Xs,t(x)=x+(ts)vs,t(x),X_{s,t}(x) = x + (t-s)v_{s,t}(x), which implies the diagonal constraint vt,t(x)=bt(x)v_{t,t}(x) = b_t(x).

3. Self-Distillation Loss Construction

LSD trains a single neural network vs,t(x)v_{s,t}(x) so that it satisfies two properties:

  • On the diagonal s=ts=t, vt,t(x)v_{t,t}(x) matches bt(x)b_t(x) (the velocity field).
  • Off the diagonal s<ts < t, the flow map parameterization satisfies the Lagrangian PDE.

The training objective has two components:

  • Diagonal (flow matching) loss: Lb(v)=01Ex0,x1ρvt,t(It)I˙t2dt,L_b(v) = \int_{0}^{1} E_{x_0,x_1 \sim \rho}\, |v_{t,t}(I_t) - \dot{I}_t|^2\,dt, which is uniquely minimized when vt,t(x)=bt(x)v_{t,t}(x) = b_t(x).
  • Off-diagonal Lagrangian self-distillation loss: LDLSD(v)=01 ⁣ ⁣0tEx0,x1tX^s,t(Is)vt,t(X^s,t(Is))2dsdt,L_D^{\text{LSD}}(v) = \int_{0}^{1}\!\!\int_{0}^{t} E_{x_0,x_1}\, \big|\partial_t \hat{X}_{s,t}(I_s) - v_{t,t}(\hat{X}_{s,t}(I_s))\big|^2\,ds\,dt, with X^s,t(x)=x+(ts)vs,t(x)\hat{X}_{s,t}(x) = x + (t-s)v_{s,t}(x) and time derivatives computed via automatic differentiation. At the global minimum, tXs,t(x)=bt(Xs,t(x))\partial_t X_{s,t}(x) = b_t(X_{s,t}(x)) holds.
  • Combined self-distillation loss: LSD(v)=Lb(v)+LDLSD(v)L_{\rm SD}(v) = L_b(v) + L_D^{\text{LSD}}(v) Minimization of LSDL_{\rm SD} enforces the tangent (Lagrangian) flow condition everywhere. No external teacher model is used; the analytic diagonal signal I˙t\dot{I}_t serves as the regression target.

4. Practical Training Details and Algorithmic Strategies

Several important practical strategies characterize efficient LSD training:

  • Teacher warmup: Initially train only the diagonal loss LbL_b (s=ts=t) for several thousand steps to guide vt,tv_{t,t} toward btb_t before introducing off-diagonal terms.
  • Gradual incorporation of off-diagonal terms: The maximum allowed ts|t-s| is linearly annealed, smoothly ramping in the Lagrangian self-distillation loss over the off-diagonals.
  • Sampling over time indices: The time pairs (s,t)(s,t) are sampled uniformly or with a learned weight over the simplex 0st10 \leq s \leq t \leq 1.
  • Efficient differentiation: Autodifferentiation primitives (e.g., jvp/vjp in autodiff frameworks) are employed to compute tX\partial_t X or sX\partial_s X and spatial gradients xX\nabla_x X efficiently.

5. Comparative Variants: ESD and PSD

The LSD loss as formulated requires computation of time-derivatives of the network with respect to tt. The framework encompasses alternative objectives:

  • Eulerian Self-Distillation (ESD): Enforces the flow PDE in Eulerian coordinates, introducing a spatial derivative term: sXs,t(x)+xXs,t(x)vs,s(x)=0,\partial_s X_{s,t}(x) + \nabla_x X_{s,t}(x) v_{s,s}(x) = 0, which necessitates spatial gradients xX\nabla_x X.
  • Progressive Self-Distillation (PSD): Enforces the semigroup property Xs,t=Xu,tXs,uX_{s,t} = X_{u,t} \circ X_{s,u} using a three-point loss that avoids both time and spatial derivatives.

Empirical results indicate that, on high-dimensional tasks such as image synthesis, both LSD and ESD suffer from high-variance gradients and training instability due to derivative computations, whereas PSD's derivative-free objective yields more stable training and improved FID scores for single- or two-step X0,1X_{0,1}. On low-dimensional problems, LSD's derivative-based objectives capture sharp, non-linear features more accurately, as the off-diagonal PDE residual enables learning of steep boundaries in multimodal densities.

6. Theoretical Guarantees and Limitations

The proposed LSD methodology comes with guarantees on generative distribution matching. As LSD0L_{\rm SD} \to 0, the 2-Wasserstein distance between the learned law L(X0,1(ρ0))\mathcal{L}(X_{0,1}(\rho_0)) and ρ1\rho_1 converges to zero at a rate O(L)\mathcal{O}(\sqrt{L}). That is, minimizing the combined LSD loss provably yields accurate learned flow maps in the sense of optimal transport, bridging the base and target laws via the consistency model approach.

LSD addresses a central limitation of prior distillation schemes for consistency models, namely the dependency on pre-trained teacher networks and multi-stage distillation. By exploiting the analytic diagonal signal and the Lagrangian structure of the flow map, training is converted to a self-distillation procedure.

A frequent misconception is that off-diagonal self-distillation necessarily improves performance in high dimensions. Empirical observations indicate that, in practice, derivative-based objectives such as LSD and ESD can introduce instability and high-variance gradients, especially for image synthesis, whereas PSD provides superior stability and generative metrics in these settings (Boffi et al., 24 May 2025). Conversely, derivative-based LSD is advantageous in low-dimensional problems with sharp boundaries.

The systematic framework outlined in Boffi et al. (2024) and extended in the referenced work demonstrates that objective selection (LSD, ESD, PSD) should be matched to the ambient dimension and the specific properties of the target distribution. LSD, as a member of this taxonomy, is most beneficial where Lagrangian PDE enforcement enables precise geometric shaping of the learned flow.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Lagrangian Self-Distillation (CSD).