Latent Stochastic Interpolants

Updated 29 November 2025

LSI is a framework that extends stochastic interpolants into latent space, enabling efficient generative modeling with continuous-time transport between arbitrary priors and data-driven posteriors.
It jointly optimizes the encoder, decoder, and latent dynamics using a unified SDE/ODE formulation, reducing computational cost and enhancing training efficiency.
LSI achieves competitive performance in generative tasks by lowering sampling FLOPs and supporting flexible latent representations, as validated by empirical and numerical analyses.

Latent Stochastic Interpolants (LSI) are a theoretical and algorithmic extension of the stochastic interpolant (SI) framework, transferring SI’s continuous-time transport between probability measures from data or observation space into a learned, low-dimensional latent space. This facilitates highly flexible generative modeling, supports arbitrary latent priors, and enables efficient sampling and training schemes using a unified SDE/ODE formalism. LSI jointly optimizes the encoder, decoder, and latent-path dynamics, allowing end-to-end learning of both latent representations and the generative process bridging an arbitrary prior to the data-driven aggregated posterior. This section details the mathematical formulation, algorithmic structure, sample complexity, numerical schemes, and salient properties of the LSI framework, referencing definitions and empirical results as established across the literature, principally in (Singh et al., 2 Jun 2025, Albergo et al., 2023, Liu et al., 10 Aug 2025, Hoellmer et al., 4 Feb 2025), and (Albergo et al., 2022).

1. Mathematical Definition of LSI and Theoretical Foundations

Latent Stochastic Interpolants generalize SI by defining a continuous-time stochastic process in latent space via reparameterized bridges between a fixed prior $p_0(z)$ and the encoder-aggregated posterior $\int p(x_1) p_\theta(z_1|x_1) dx_1$ , optimized jointly with autoencoding components. An LSI interpolant is given as

$z_t = \eta_t \epsilon + \kappa_t z_1 + \nu_t z_0,$

where $z_0 \sim p_0(z_0)$ (arbitrary prior), $z_1 \sim p_\theta(z_1|x_1)$ (learned encoder distribution), $\epsilon\sim\mathcal{N}(0, I)$ , and $\{\kappa_t, \nu_t, \eta_t\}$ are scalar or vector-valued schedules parameterizing the bridge, typically satisfying

$\kappa_0=0,\, \kappa_1=1,\, \nu_0=1,\, \nu_1=0,\, \eta_0=\eta_1=0,\, \eta_t>0\ \text{for}\ t \in (0,1).$

A canonical choice is the Brownian-bridge: $\kappa_t=t,\, \nu_t=1-t,\, \eta_t=\sigma\sqrt{t(1-t)}$ .

The time-marginal law of $z_t$ interpolates between $p_0$ at $t=0$ and the aggregated posterior at $t=1$ exactly in finite time. This construction enables LSI to model arbitrary transformations between latent distributions, with sampling and transport realized via an SDE or its probability-flow ODE counterpart: $d z_t = h_\theta(z_t, t)\, dt + \sigma(t)\, d w_t,\qquad z_0 \sim p_0(z_0).$ The drift $h_\theta$ is learned to match the reference bridge determined by the interpolant.

The theoretical core is the reduction of learning a complex generative map to the problem of matching the SDE’s marginal at $t=1$ to the decoder’s latent aggregated posterior, with tightness of the continuous-time evidence lower bound (ELBO) when the variational drift equals the model drift (Singh et al., 2 Jun 2025, Albergo et al., 2023).

2. Variational Objective and Algorithmic Structure

The LSI variational objective is an ELBO derived explicitly in continuous time. Denoting $\tilde{z}_t$ as the path under the model SDE and $z_t$ as the bridge interpolation (variational posterior), the ELBO is

$\ln p_\theta(x_1) \geq \mathbb{E}_{q(z_t)} \left[ \ln p_\theta(x_1|z_1) - \frac{1}{2}\int_0^1 \| u(z_t, t) \|^2 dt \right],$

where $u$ is the optimal control,

$\sigma(z,t)\, u(z,t) = h_\phi(z,t) - h_\theta(z,t),$

with $h_\phi$ specifying the drift for the closed-form sample path defined by the bridge $z_t$ .

In practice, LSI is trained in a single-stage, end-to-end manner:

Sample mini-batch $\{x_1^{(i)}\}$ , encode to stochastic $z_1^{(i)}\sim p_\theta(z_1|x_1)$ , sample $z_0^{(i)} \sim p_0$ and noise $\epsilon\sim\mathcal{N}(0,I)$ , and time $t \sim U[0,1]$ .
Compute $z_t$ via the interpolant.
Evaluate ELBO terms (reconstruction and drift loss).
Backpropagate jointly through encoder, decoder, and $h_\theta$ .

A numerically robust variant is the $\mathrm{flow}$ parameterization, where the loss takes the form

$\mathcal{L}(\theta) = \mathbb{E}\left[-\ln p_\theta(x_1|z_1) + (\beta_t/2) \|\cdots\|^2 \right],$

with details in [(Singh et al., 2 Jun 2025), eq. 19] and schedule hyperparameters (e.g. $\beta_t$ ) optimized per task.

3. Numerical Solvers and Convergence Analysis

Sampling via LSI typically integrates the learned ODE/SDE in latent space. Accurate transport is guaranteed to first- or second-order in time-step for popular integrators. The TV-distance between the true and discretized path can be upper-bounded as

$\text{TV} \left( \rho(t_N), \hat\rho(t_N) \right) \lesssim \cdots + \sum_{k=0}^{N-1} h_k^{2/3}[\cdots],$

with first-order (Euler, $O(1/N)$ ) and second-order (Heun/RK2, $O(1/N^2)$ ) error scaling, provided $\gamma(t)>0$ is handled or early-stopping avoids endpoints where $\gamma(t)\to 0$ (Liu et al., 10 Aug 2025).

The interpolant noise scale $\gamma(t)$ determines stiffness and grid selection; adaptive schedules where $h_k\propto\gamma(t_k)^2$ substantially improve efficiency, as can employing second-order methods with only two drift evaluations per step.

4. Relation to Standard Diffusion Models and Other Generative Frameworks

LSI encompasses and generalizes standard score-based diffusion and flow-matching models:

LSI allows arbitrary prior $p_0(z_0)$ (Gaussian, uniform, Laplace, or learned), whereas diffusion models conventionally fix $p_0=\mathcal{N}(0, I)$ .
LSI runs generative dynamics in a low-dimensional latent domain, sharply reducing computational cost relative to SI or diffusion ran directly in pixel space [(Singh et al., 2 Jun 2025) Table 1, Table 7].
Both ODE-based (flow-matching) and SDE-based (score-based) LSI variants are supported by the same quadratic regression/ELBO objective, unifying flows and diffusions and connecting with Schrödinger bridge approaches (Albergo et al., 2023, Albergo et al., 2022).
For multimarginal or multitask settings, operator- or simplex-based generalizations of the SI time index enable one-shot, multi-way LSI frameworks, supporting inpainting, style transfer, and conditional sampling without retraining (Albergo et al., 2023, Negrel et al., 6 Aug 2025).

5. Model Architectures, Parameterization, and Training Details

Empirically, in large-scale image modeling (ImageNet), the LSI pipeline comprises:

Encoder/decoder: 2-stage convolutional stack, with up/downsampling and linear token layers.
Latent SI (drift) network: U-Net-like stack of self-attention Transformer blocks at latent spatial resolution (e.g., $16\times 16$ ).
Optimizer: AdamW $(\beta_1=0.9,~\beta_2=0.99,~\epsilon=10^{-12})$ , batch size 2048, up to 2000 epochs, learning rate warmup, cosine decay, exponential moving average with 0.9999 decay.
Time parameterization: $t\sim U[0,1]$ , possibly warped as $t(s)=1-(1-s)^c$ with $c=1$ .
Noise scale $\beta$ scanned (e.g. $\beta\in[10^{-6},5\times 10^{-3}]$ ), with best performance for $\beta\approx 10^{-4}$ at $128^2$ resolution [(Singh et al., 2 Jun 2025), Appendix I].

Latent interpolation schedules ( $\kappa_t, \nu_t, \eta_t$ ) are generally chosen as linear or Brownian-bridge-like, with stochastic encoding providing regularization and improved generative diversity.

6. Empirical Evaluation and Benchmarks

LSI demonstrates competitive or superior performance to observation-space SI and pixel-space diffusion on standard generative metrics, using order-of-magnitude fewer sampling FLOPs. On ImageNet:

FID for LSI ("Latent") at $128^2$ is 3.12, outperforming SI in observation space (3.46) with reduced computational budget [(Singh et al., 2 Jun 2025), Table 1].
Arbitrary priors (uniform, Laplace, Gaussian) deliver similar performance (e.g., FID 4.81, 4.45, 3.76) [(Singh et al., 2 Jun 2025), Table 7].
Classifier-free guidance and inversion via SDE/ODE sampling are effective and seamlessly supported due to the flexible latent dynamics [(Singh et al., 2 Jun 2025), Figures 6 and 7].
Physical emulation with LSI matches or improves upon FNO, DDPM, and FM baselines on PDE and climate tasks while requiring only $2$–$10$ integration steps for deterministic predictions or higher for calibrated ensemble spread (Zhou et al., 30 Sep 2025).

7. Extensions, Limitations, and Future Directions

LSI is extensible to:

Arbitrary prior/posterior pairs, discrete and periodic-valued spaces (e.g., with atomistic flows in crystalline materials (Hoellmer et al., 4 Feb 2025)), operator-based multitask learning (Negrel et al., 6 Aug 2025), and multimarginal translation (Albergo et al., 2023).
SDE or ODE solvers, with practical guidance to handle endpoint singularities or hyperparameter sensitivity.
Multimodal, hierarchical, or domain-conditional latent representations (Singh et al., 2 Jun 2025).

Limitations noted include:

Variational posterior $q(z_t|z_0,z_1)$ is often assumed to follow linear bridges and fixed $\sigma_t$ , potentially restricting expressiveness relative to general diffusion bridges.
Numerical issues near $t=0,1$ due to $\gamma(t)$ vanishing (endpoint singularity).
Quantitative performance is coupled to expressive power of the autoencoder/decoder and capacity of the drift network.
Theoretical guarantees assume correctly specified conditional expectations and sufficiently expressive function approximators.

Open research directions include learning more flexible ( $t$ -dependent, nonlinear) variational bridges, extension to other modalities (audio, video), tighter ELBO variants, explicit connections with Schrödinger bridge or Wasserstein gradient flows in learned latent geometries, and adaptive numerical integration strategies to further optimize sampling tradeoffs (Singh et al., 2 Jun 2025).

References:

"Latent Stochastic Interpolants" (Singh et al., 2 Jun 2025)
"Stochastic Interpolants: A Unifying Framework for Flows and Diffusions" (Albergo et al., 2023)
"Finite-Time Convergence Analysis of ODE-based Generative Models for Stochastic Interpolants" (Liu et al., 10 Aug 2025)
"Open Materials Generation with Stochastic Interpolants" (Hoellmer et al., 4 Feb 2025)
"Building Normalizing Flows with Stochastic Interpolants" (Albergo et al., 2022)
"Multimarginal generative modeling with stochastic interpolants" (Albergo et al., 2023)
"Multitask Learning with Stochastic Interpolants" (Negrel et al., 6 Aug 2025)
"Reframing Generative Models for Physical Systems using Stochastic Interpolants" (Zhou et al., 30 Sep 2025)