Latent Schrödinger Bridge Models

Updated 3 May 2026

Latent Schrödinger Bridge Models are generative frameworks that transform samples between probability distributions via entropic optimal transport in a learned latent space.
They combine encoder–decoder architectures with stochastic differential equations and neural solvers such as score-based diffusion and neural ODEs for efficient, high-dimensional modeling.
Empirical results demonstrate state-of-the-art performance in tasks like 3D shape completion and image synthesis with significant improvements in speed and parameter efficiency.

Latent Schrödinger Bridge Models are a class of generative modeling and optimal transport frameworks that formulate the problem of transforming samples from one probability distribution to another as an entropic optimal transport, realized in a learned low-dimensional latent space. The core principle is to explicitly model the globally optimal stochastic dynamics (“bridge”) that couple origin and target distributions by minimizing the Kullback–Leibler divergence to a reference stochastic process, typically a Brownian motion or reference diffusion. By learning these dynamics in the latent space of a neural encoder–decoder, these models simultaneously benefit from computational tractability, improved sample quality, and rigorous theoretical guarantees for high-dimensional data. Recent architectures unify Schrödinger bridge theory, deep score-based diffusion, and variational latent compression, spanning diverse applications such as 3D shape completion, image synthesis, and latent-space optimal transport (Kong et al., 29 Jun 2025, Jiao et al., 2024, Khilchuk et al., 14 Dec 2025).

1. The Schrödinger Bridge Formulation in Latent Space

The dynamic Schrödinger bridge problem seeks a stochastic process $(z_t)_{t\in[0,T]}$ whose endpoints marginally realize two prescribed distributions $\pi_0$ and $\pi_1$ (e.g., corresponding to complete and incomplete data), while remaining minimal in relative entropy to a reference process, typically a diffusion. In latent space, this is formalized as: $\mathbb{P}^* = \arg\min_{\mathbb{P} : \mathbb{P}_{t=0}=\pi_0,\,\mathbb{P}_{t=T}=\pi_1} \mathrm{KL}(\mathbb{P}\|\mathbb{Q}),$ where $\mathbb{Q}$ is the law of a reference SDE, such as

$dz_t = f(z_t,t)\,dt + g(t)\,dW_t$

with drift $f$ and diffusion $g$ . The solution $\mathbb{P}^*$ induces a forward SDE of the form

$dz_t = \bigl[ f(z_t,t) + g^2(t)\nabla \log\Psi_t(z_t) \bigr]dt + g(t)\,dW_t$

along with a coupled backward SDE. The functions $\pi_0$ 0 solve Schrödinger-type PDEs with endpoint constraints $\pi_0$ 1 (Kong et al., 29 Jun 2025, Jiao et al., 2024).

This construction is equivalent to entropic optimal transport, regularizing the classical Monge–Kantorovich problem by penalizing deviations from a stochastic reference path via the path-space KL divergence.

2. Latent Representations: Encoder–Decoder Architectures

Latent Schrödinger bridge models operate in a learned latent representation, induced by a neural autoencoder or variational autoencoder (VAE). Data $\pi_0$ 2 is mapped to a lower-dimensional code $\pi_0$ 3: $\pi_0$ 4 where $\pi_0$ 5 is the ambient dimension. In “BridgeShape,” a vector-quantized VAE (VQ-VAE) equipped with depth-enhanced features encodes high-resolution 3D shapes into a structured latent grid, maximizing geometric fidelity and compressibility (Kong et al., 29 Jun 2025). The latent space distributions $\pi_0$ 6 (complete) and $\pi_0$ 7 (incomplete/partial) are constructed by encoding datasets of paired data. Encoder–decoder pre-training is performed via MSE reconstruction loss: $\pi_0$ 8 Theoretical results guarantee that, under compression regularity, the end-to-end reconstruction error decays as a function of pre-training dataset size and latent dimension (Jiao et al., 2024).

3. Algorithms: Neural, Symbolic, and Hybrid Solvers

Latent SB models support several algorithmic paradigms for solving the entropic transport:

Neural Score-Based Diffusion: A neural network $\pi_0$ 9 is trained to parameterize conditional noise in the Gaussian bridge, using score matching over paired endpoint latent codes and intermediate noisy latents (Kong et al., 29 Jun 2025). The training objective is

$\pi_1$ 0

with the latent bridge posterior $\pi_1$ 1 given in closed form.

Neural ODE Surrogates: The continuous-time bridge drift is parameterized as a neural ODE vector field,

$\pi_1$ 2

trained via iterative matching to bridge velocities derived from the SDE and endpoint interpolation strategies (Khilchuk et al., 14 Dec 2025). Both forward and backward ODEs are learned, offering superior control over sampling and computational efficiency.

Symbolic SINDy Flow Matching: For low-dimensional or nearly Gaussian latent spaces, the bridge dynamics can be represented by a sparse symbolic regression model,

$\pi_1$ 3

where $\pi_1$ 4 is a polynomial feature library and $\pi_1$ 5 fitted via $\pi_1$ 6-regularized least squares. This reduction yields interpretable, efficient models with orders-of-magnitude fewer parameters and near-instantaneous inference (Khilchuk et al., 14 Dec 2025).

Comparison Table: Key Latent SB Algorithms

Algorithm	Expressivity	Sample Efficiency	Interpretability
Neural Diffusion	Arbitrary	Moderate	Black-box
Neural ODE	High (continuous)	High	Moderate
SINDy-FM	Limited (polynomial)	Very high	Explicit/Symbolic

4. Training Procedures and Architectures

Comprehensive recipes for latent SB training are available. The two-stage regime is common:

Stage I: Pre-train the latent autoencoder (VAE or VQ-VAE) on a large dataset, only using full data (e.g., complete 3D shapes), freezing the encoder and decoder afterward. For depth-enhanced 3D tasks, multi-view rendering with DINOv2 features and cross-attention fusion are used in the encoder (Kong et al., 29 Jun 2025).
Stage II: Train the bridge model (neural diffusion, ODE, or symbolic) in the latent space. Endpoint pairs $\pi_1$ 7 are sampled (for conditional tasks, partial data is encoded), and the model is optimized using either score-matching regression or direct flow-matching.

For practical efficiency, BridgeShape applies Gaussian based bridge posteriors, enabling sampling in three steps—representing a significant reduction in inference time compared to standard DDPM pipelines which require hundreds of steps (Kong et al., 29 Jun 2025).

5. Theoretical Guarantees and Convergence

A distinguishing feature of latent SB models is the end-to-end theoretical analysis for distributional approximation. The error between generated and target data distributions, measured in Wasserstein-2 distance, decomposes as

$\pi_1$ 8

Crucially, the dominant convergence rate scales only with the dimension of the latent space $\pi_1$ 9, yielding

$\mathbb{P}^* = \arg\min_{\mathbb{P} : \mathbb{P}_{t=0}=\pi_0,\,\mathbb{P}_{t=T}=\pi_1} \mathrm{KL}(\mathbb{P}\|\mathbb{Q}),$ 0

where $\mathbb{P}^* = \arg\min_{\mathbb{P} : \mathbb{P}_{t=0}=\pi_0,\,\mathbb{P}_{t=T}=\pi_1} \mathrm{KL}(\mathbb{P}\|\mathbb{Q}),$ 1 is the number of grid steps, $\mathbb{P}^* = \arg\min_{\mathbb{P} : \mathbb{P}_{t=0}=\pi_0,\,\mathbb{P}_{t=T}=\pi_1} \mathrm{KL}(\mathbb{P}\|\mathbb{Q}),$ 2 is the domain-shift error (data distribution mismatch), and $\mathbb{P}^* = \arg\min_{\mathbb{P} : \mathbb{P}_{t=0}=\pi_0,\,\mathbb{P}_{t=T}=\pi_1} \mathrm{KL}(\mathbb{P}\|\mathbb{Q}),$ 3 the encoder–decoder error. This result demonstrates that latent SB models can avoid the curse of dimensionality inherent to data-space diffusion, provided the latent manifold is sufficiently compact (Jiao et al., 2024).

6. Empirical Performance and Practical Impact

BridgeShape and related methods demonstrate state-of-the-art results in 3D shape completion and generative translation tasks. On 3D-EPN and PatchComplete, BridgeShape significantly outperforms prior methods in L1/TUDF grid error, Chamfer Distance, and volumetric IoU, improving both known and unseen categories. Resolution scaling directly translates to continued accuracy gains, with efficient inference enabled by latent-bridge sampling (three reverse steps, 0.04 s total, compared to 100+ in DDPM-based baselines) (Kong et al., 29 Jun 2025).

For latent translation on MNIST, SINDy-FM achieves similar FID and Inception scores compared to neural ODE surrogates, with dramatic reductions in parameter count and computation time (100 $\mathbb{P}^* = \arg\min_{\mathbb{P} : \mathbb{P}_{t=0}=\pi_0,\,\mathbb{P}_{t=T}=\pi_1} \mathrm{KL}(\mathbb{P}\|\mathbb{Q}),$ 4 faster inference, 300 $\mathbb{P}^* = \arg\min_{\mathbb{P} : \mathbb{P}_{t=0}=\pi_0,\,\mathbb{P}_{t=T}=\pi_1} \mathrm{KL}(\mathbb{P}\|\mathbb{Q}),$ 5 fewer parameters), while producing visually coherent samples. Neural ODE surrogates offer improved flexibility for more complex latent transport (Khilchuk et al., 14 Dec 2025).

7. Recommendations and Future Directions

Selection of the bridge solver should be matched to the geometry of the latent manifold. For nearly Gaussian latent spaces or when interpretability and low latency are paramount, symbolic surrogates (SINDy-FM) are optimal. For highly nonlinear or complex latent structures, neural ODE surrogates maintain expressivity with competitive efficiency. Pretraining the bridge drift on reference diffusion stabilizes learning in all scenarios. Hybrid (symbolic+neural) schemes can combine interpretability and expressive power. Further research is warranted on direct construction of latent encoders for arbitrary data modalities and further improving error analysis for non-Gaussian latent distributions (Khilchuk et al., 14 Dec 2025, Jiao et al., 2024).