Papers
Topics
Authors
Recent
2000 character limit reached

Latent Stochastic Interpolants

Updated 29 November 2025
  • LSI is a framework that extends stochastic interpolants into latent space, enabling efficient generative modeling with continuous-time transport between arbitrary priors and data-driven posteriors.
  • It jointly optimizes the encoder, decoder, and latent dynamics using a unified SDE/ODE formulation, reducing computational cost and enhancing training efficiency.
  • LSI achieves competitive performance in generative tasks by lowering sampling FLOPs and supporting flexible latent representations, as validated by empirical and numerical analyses.

Latent Stochastic Interpolants (LSI) are a theoretical and algorithmic extension of the stochastic interpolant (SI) framework, transferring SI’s continuous-time transport between probability measures from data or observation space into a learned, low-dimensional latent space. This facilitates highly flexible generative modeling, supports arbitrary latent priors, and enables efficient sampling and training schemes using a unified SDE/ODE formalism. LSI jointly optimizes the encoder, decoder, and latent-path dynamics, allowing end-to-end learning of both latent representations and the generative process bridging an arbitrary prior to the data-driven aggregated posterior. This section details the mathematical formulation, algorithmic structure, sample complexity, numerical schemes, and salient properties of the LSI framework, referencing definitions and empirical results as established across the literature, principally in (Singh et al., 2 Jun 2025, Albergo et al., 2023, Liu et al., 10 Aug 2025, Hoellmer et al., 4 Feb 2025), and (Albergo et al., 2022).

1. Mathematical Definition of LSI and Theoretical Foundations

Latent Stochastic Interpolants generalize SI by defining a continuous-time stochastic process in latent space via reparameterized bridges between a fixed prior p0(z)p_0(z) and the encoder-aggregated posterior p(x1)pθ(z1x1)dx1\int p(x_1) p_\theta(z_1|x_1) dx_1, optimized jointly with autoencoding components. An LSI interpolant is given as

zt=ηtϵ+κtz1+νtz0,z_t = \eta_t \epsilon + \kappa_t z_1 + \nu_t z_0,

where z0p0(z0)z_0 \sim p_0(z_0) (arbitrary prior), z1pθ(z1x1)z_1 \sim p_\theta(z_1|x_1) (learned encoder distribution), ϵN(0,I)\epsilon\sim\mathcal{N}(0, I), and {κt,νt,ηt}\{\kappa_t, \nu_t, \eta_t\} are scalar or vector-valued schedules parameterizing the bridge, typically satisfying

κ0=0,κ1=1,ν0=1,ν1=0,η0=η1=0,ηt>0 for t(0,1).\kappa_0=0,\, \kappa_1=1,\, \nu_0=1,\, \nu_1=0,\, \eta_0=\eta_1=0,\, \eta_t>0\ \text{for}\ t \in (0,1).

A canonical choice is the Brownian-bridge: κt=t,νt=1t,ηt=σt(1t)\kappa_t=t,\, \nu_t=1-t,\, \eta_t=\sigma\sqrt{t(1-t)}.

The time-marginal law of ztz_t interpolates between p0p_0 at t=0t=0 and the aggregated posterior at t=1t=1 exactly in finite time. This construction enables LSI to model arbitrary transformations between latent distributions, with sampling and transport realized via an SDE or its probability-flow ODE counterpart: dzt=hθ(zt,t)dt+σ(t)dwt,z0p0(z0).d z_t = h_\theta(z_t, t)\, dt + \sigma(t)\, d w_t,\qquad z_0 \sim p_0(z_0). The drift hθh_\theta is learned to match the reference bridge determined by the interpolant.

The theoretical core is the reduction of learning a complex generative map to the problem of matching the SDE’s marginal at t=1t=1 to the decoder’s latent aggregated posterior, with tightness of the continuous-time evidence lower bound (ELBO) when the variational drift equals the model drift (Singh et al., 2 Jun 2025, Albergo et al., 2023).

2. Variational Objective and Algorithmic Structure

The LSI variational objective is an ELBO derived explicitly in continuous time. Denoting z~t\tilde{z}_t as the path under the model SDE and ztz_t as the bridge interpolation (variational posterior), the ELBO is

lnpθ(x1)Eq(zt)[lnpθ(x1z1)1201u(zt,t)2dt],\ln p_\theta(x_1) \geq \mathbb{E}_{q(z_t)} \left[ \ln p_\theta(x_1|z_1) - \frac{1}{2}\int_0^1 \| u(z_t, t) \|^2 dt \right],

where uu is the optimal control,

σ(z,t)u(z,t)=hϕ(z,t)hθ(z,t),\sigma(z,t)\, u(z,t) = h_\phi(z,t) - h_\theta(z,t),

with hϕh_\phi specifying the drift for the closed-form sample path defined by the bridge ztz_t.

In practice, LSI is trained in a single-stage, end-to-end manner:

  • Sample mini-batch {x1(i)}\{x_1^{(i)}\}, encode to stochastic z1(i)pθ(z1x1)z_1^{(i)}\sim p_\theta(z_1|x_1), sample z0(i)p0z_0^{(i)} \sim p_0 and noise ϵN(0,I)\epsilon\sim\mathcal{N}(0,I), and time tU[0,1]t \sim U[0,1].
  • Compute ztz_t via the interpolant.
  • Evaluate ELBO terms (reconstruction and drift loss).
  • Backpropagate jointly through encoder, decoder, and hθh_\theta.

A numerically robust variant is the flow\mathrm{flow} parameterization, where the loss takes the form

L(θ)=E[lnpθ(x1z1)+(βt/2)2],\mathcal{L}(\theta) = \mathbb{E}\left[-\ln p_\theta(x_1|z_1) + (\beta_t/2) \|\cdots\|^2 \right],

with details in [(Singh et al., 2 Jun 2025), eq. 19] and schedule hyperparameters (e.g. βt\beta_t) optimized per task.

3. Numerical Solvers and Convergence Analysis

Sampling via LSI typically integrates the learned ODE/SDE in latent space. Accurate transport is guaranteed to first- or second-order in time-step for popular integrators. The TV-distance between the true and discretized path can be upper-bounded as

TV(ρ(tN),ρ^(tN))+k=0N1hk2/3[],\text{TV} \left( \rho(t_N), \hat\rho(t_N) \right) \lesssim \cdots + \sum_{k=0}^{N-1} h_k^{2/3}[\cdots],

with first-order (Euler, O(1/N)O(1/N)) and second-order (Heun/RK2, O(1/N2)O(1/N^2)) error scaling, provided γ(t)>0\gamma(t)>0 is handled or early-stopping avoids endpoints where γ(t)0\gamma(t)\to 0 (Liu et al., 10 Aug 2025).

The interpolant noise scale γ(t)\gamma(t) determines stiffness and grid selection; adaptive schedules where hkγ(tk)2h_k\propto\gamma(t_k)^2 substantially improve efficiency, as can employing second-order methods with only two drift evaluations per step.

4. Relation to Standard Diffusion Models and Other Generative Frameworks

LSI encompasses and generalizes standard score-based diffusion and flow-matching models:

  • LSI allows arbitrary prior p0(z0)p_0(z_0) (Gaussian, uniform, Laplace, or learned), whereas diffusion models conventionally fix p0=N(0,I)p_0=\mathcal{N}(0, I).
  • LSI runs generative dynamics in a low-dimensional latent domain, sharply reducing computational cost relative to SI or diffusion ran directly in pixel space [(Singh et al., 2 Jun 2025) Table 1, Table 7].
  • Both ODE-based (flow-matching) and SDE-based (score-based) LSI variants are supported by the same quadratic regression/ELBO objective, unifying flows and diffusions and connecting with Schrödinger bridge approaches (Albergo et al., 2023, Albergo et al., 2022).
  • For multimarginal or multitask settings, operator- or simplex-based generalizations of the SI time index enable one-shot, multi-way LSI frameworks, supporting inpainting, style transfer, and conditional sampling without retraining (Albergo et al., 2023, Negrel et al., 6 Aug 2025).

5. Model Architectures, Parameterization, and Training Details

Empirically, in large-scale image modeling (ImageNet), the LSI pipeline comprises:

  • Encoder/decoder: 2-stage convolutional stack, with up/downsampling and linear token layers.
  • Latent SI (drift) network: U-Net-like stack of self-attention Transformer blocks at latent spatial resolution (e.g., 16×1616\times 16).
  • Optimizer: AdamW (β1=0.9, β2=0.99, ϵ=1012)(\beta_1=0.9,~\beta_2=0.99,~\epsilon=10^{-12}), batch size 2048, up to 2000 epochs, learning rate warmup, cosine decay, exponential moving average with 0.9999 decay.
  • Time parameterization: tU[0,1]t\sim U[0,1], possibly warped as t(s)=1(1s)ct(s)=1-(1-s)^c with c=1c=1.
  • Noise scale β\beta scanned (e.g. β[106,5×103]\beta\in[10^{-6},5\times 10^{-3}]), with best performance for β104\beta\approx 10^{-4} at 1282128^2 resolution [(Singh et al., 2 Jun 2025), Appendix I].

Latent interpolation schedules (κt,νt,ηt\kappa_t, \nu_t, \eta_t) are generally chosen as linear or Brownian-bridge-like, with stochastic encoding providing regularization and improved generative diversity.

6. Empirical Evaluation and Benchmarks

LSI demonstrates competitive or superior performance to observation-space SI and pixel-space diffusion on standard generative metrics, using order-of-magnitude fewer sampling FLOPs. On ImageNet:

  • FID for LSI ("Latent") at 1282128^2 is 3.12, outperforming SI in observation space (3.46) with reduced computational budget [(Singh et al., 2 Jun 2025), Table 1].
  • Arbitrary priors (uniform, Laplace, Gaussian) deliver similar performance (e.g., FID 4.81, 4.45, 3.76) [(Singh et al., 2 Jun 2025), Table 7].
  • Classifier-free guidance and inversion via SDE/ODE sampling are effective and seamlessly supported due to the flexible latent dynamics [(Singh et al., 2 Jun 2025), Figures 6 and 7].
  • Physical emulation with LSI matches or improves upon FNO, DDPM, and FM baselines on PDE and climate tasks while requiring only $2$–$10$ integration steps for deterministic predictions or higher for calibrated ensemble spread (Zhou et al., 30 Sep 2025).

7. Extensions, Limitations, and Future Directions

LSI is extensible to:

Limitations noted include:

  • Variational posterior q(ztz0,z1)q(z_t|z_0,z_1) is often assumed to follow linear bridges and fixed σt\sigma_t, potentially restricting expressiveness relative to general diffusion bridges.
  • Numerical issues near t=0,1t=0,1 due to γ(t)\gamma(t) vanishing (endpoint singularity).
  • Quantitative performance is coupled to expressive power of the autoencoder/decoder and capacity of the drift network.
  • Theoretical guarantees assume correctly specified conditional expectations and sufficiently expressive function approximators.

Open research directions include learning more flexible (tt-dependent, nonlinear) variational bridges, extension to other modalities (audio, video), tighter ELBO variants, explicit connections with Schrödinger bridge or Wasserstein gradient flows in learned latent geometries, and adaptive numerical integration strategies to further optimize sampling tradeoffs (Singh et al., 2 Jun 2025).


References:

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Latent Stochastic Interpolants (LSI).