Stochastic Normalizing Flows

Updated 2 February 2026

Stochastic normalizing flows are hybrid models that combine deterministic invertible maps with stochastic Markov kernels to model complex, multimodal distributions.
They use a forward-reverse path-space framework with Radon–Nikodym derivatives and KL divergence for unbiased likelihood estimation and rigorous training.
Empirical results show SNFs outperform traditional normalizing flows and MCMC methods in high-dimensional, multimodal inference tasks.

Stochastic normalizing flows (SNFs) generalize the normalizing flow paradigm by interleaving invertible, deterministic transformations with stochastic transition layers—typically Markov kernels such as Metropolis–Hastings, Langevin, or diffusion steps. This hybrid architecture augments the expressiveness of flow-based models, enabling sampling from multimodal or topologically complex target distributions and facilitating efficient inference in high-dimensional, challenging probability landscapes. SNFs have been formally characterized both through coupled forward–reverse Markov chains and as non-equilibrium transformations admitting tractable training objectives via path-space divergences, Radon–Nikodym derivatives, and stochastic thermodynamics.

1. Mathematical Foundation and Formalism

The backbone of stochastic normalizing flows is a structured sequence alternating deterministic invertible maps $\{T_i\}$ and stochastic Markov kernels $\{K_i\}$ (Wu et al., 2020, Hagemann et al., 2021). For an initial latent $z_0 \sim p_0(z)$ , the generative progression is

$z_i' = T_i(z_{i-1}), \quad z_i \sim K_i(\cdot \mid z_i')$

where $T_i$ are dimensionality-preserving, smooth diffeomorphisms (e.g., RealNVP, MAF, Glow-style), and $K_i$ are Markov kernels whose stationary distribution interpolates between the prior and target density.

This construction admits a forward chain $\{X_0\to X_1\to\dots\to X_T\}$ , and a reverse chain for exact likelihood and training. The SNF path–space law for a trajectory $(z_0,\dots,z_T)$ is given by

$q(z_0,\dots,z_T) = p_0(z_0) \prod_{i=1}^T |\det \nabla T_i(z_{i-1})| \cdot K_i(z_i | T_i(z_{i-1}))$

with the marginal $q_T(z_T)$ forming the generative model’s output (Wu et al., 2020, Hagemann et al., 2021).

The corresponding Radon–Nikodym derivative formalism rigorously handles both deterministic and stochastic layers. For SNFs constructed from Markov chains, the local log–density updates are expressed as

$\log p_{t}(z_t) = \log p_{t-1}(z_{t-1}) + \log K_t(z_t|z_{t-1}) - \log L_t(z_{t-1}|z_t)$

where $L_t$ is the time-reversed kernel. Deterministic layers recover the standard log-Jacobian correction, while stochastic MCMC layers yield detailed balance ratios (Hagemann et al., 2021, Hagemann et al., 2021).

2. Algorithmic Structure and Training Objectives

The SNF training objective generally takes the form of a path-space Kullback–Leibler divergence: $\mathcal{L}_{\rm SNF} = \mathbb E_{(z_0,\ldots,z_T)\sim \text{reverse chain}} \Bigg[\log\frac{p_X(z_T)}{p_0(z_0)} + \sum_{t=1}^T \Delta_t(z_{t-1},z_t)\Bigg]$ where $\Delta_t$ denotes the local log-ratio of reverse and forward kernels for layer $t$ (Hagemann et al., 2021). Examples include:

Deterministic layers: $\Delta_t = -\log |\det\nabla T_t(z_{t-1})|$
Metropolis–Hastings or Langevin layers: $\Delta_t = \log p_t(z_{t-1}) - \log p_t(z_t)$

For SNFs applied to Bayesian inverse problems, kernels are made conditional on observed data $y$ , and the reverse chain targets interpolations between the prior and posterior distribution along a geometric path (Hagemann et al., 2021).

End-to-end training typically employs stochastic gradient descent, sampling SNF trajectories, accumulating losses via the path-space KL, and backpropagating through deterministic transformations. Gradients through stochastic layers leverage the fact that accept–reject indicators have zero gradient almost everywhere (under mild regularity assumptions) (Hagemann et al., 2021).

3. Architectural Variants and Expressivity

Deterministic normalizing flows are continuous bijections and thus cannot transform a unimodal latent density into a multimodal or disconnected target without resorting to highly unstable mappings. Inserting stochastic transitions between flow layers allows the model to probabilistically transition between modes and traverse low-density regions, overcoming expressivity limitations (Wu et al., 2020, Hagemann et al., 2021).

Layer types commonly include:

Coupling layers (e.g., RealNVP, Glow, MAF): efficient, exact computation of Jacobians.
Langevin dynamics: $x_t = x_{t-1} - a_1\nabla u_t(x_{t-1}) + a_2\xi_t$ , $\Delta_t$ derived from proposal reversibility.
Metropolis–Hastings kernels: accept/reject dynamics with symmetric proposals.
Diffusion-driven layers: Euler–Maruyama discretized SDE (Hagemann et al., 2021).
Variational autoencoder layers: integrated via their ELBO-style path-space loss (Hagemann et al., 2021).

Cross-architecture mixtures—e.g., interleaving deep coupling blocks with MCMC kernels—yield composite SNFs capable of efficiently sampling from complex, multimodal densities (Wu et al., 2020, Caselle et al., 2022).

4. Theoretical Guarantees and Thermodynamic Connections

The SNF formalism guarantees that, under appropriate ergodicity and detailed balance conditions for the stochastic kernels, the model can sample from the stationary distribution of the target, with explicit unbiased likelihood computation via path-space Radon–Nikodym derivatives (Hagemann et al., 2021, Hagemann et al., 2021).

SNFs are tied to non-equilibrium statistical mechanics through Jarzynski’s equality and the Crooks fluctuation theorem (Caselle et al., 2022, Caselle et al., 2022, Caselle et al., 2024). Each trajectory accumulates a “work” term consisting of deterministic flow (log-Jacobian “heat”) and stochastic transitions (“Monte Carlo heat”), with unbiased estimators for free energy and partition function ratios: $\frac{Z_{\rm target}}{Z_{\rm prior}} = \langle\exp(-w)\rangle_{\rm paths}$ where $w$ aggregates the action differences, Jacobian corrections, and MC heat along SNF paths.

5. Empirical Evidence and Applications

The empirical superiority of SNFs has been demonstrated on multimodal inference, generative modeling, molecular equilibrium sampling, time-series density estimation, and lattice field theory (Wu et al., 2020, Hagemann et al., 2021, Hagemann et al., 2021, Caselle et al., 2022, Caselle et al., 2024). Key benchmarks include:

High effective sample size (ESS) across modes in multimodal Gaussian mixtures—SNFs achieve $\approx 80\%$ ESS compared to $10\%$ for pure NF and $0.1\%$ for Langevin (Wu et al., 2020).
Low overlap error ( $<5\%$ ) and rapid mixing ( $\sim 10^2$ steps) on 2D multimodal benchmarks.
Robust inference for inverse problems (scatterometry, nonlinear regression), outperforming conditional invertible networks (INNs) on Wasserstein and KL metrics (Hagemann et al., 2021).
Superior partition-function estimation and observable calculation in lattice field theory, with SNFs maintaining near-ideal ESS while requiring orders-of-magnitude fewer MC updates than standard methods (Caselle et al., 2022, Caselle et al., 2024).

6. Extensions, Limitations, and Future Directions

Recent developments include pseudo-reversible normalizing flows for conditional sampling of SDE final states given arbitrary initial distributions, exploiting paired (non-exactly invertible) networks with soft reversibility penalties and proven KL convergence (Yang et al., 2023).

Algorithmic limitations of SNFs center on gradient estimation variance in highly stochastic chains, potential bias if stochastic layers lack reparameterizability, and protocol scheduling for optimal mixing in correlated target domains (Caselle et al., 2022).

Future directions include continuous-time SNFs using neural SDEs (Hodgkinson et al., 2020, Zhang et al., 2021), strict integration with score-based models, extensions to gauge field sampling, and adaptive scheduling guided by real-time work variance (Caselle et al., 2024, Caselle et al., 2022).

7. Summary Table: SNF Components

Layer Type	Transition Mechanism	Density Correction/Ratio
Deterministic NF	$x' = T(x)$	$-\log\|\det \nabla T(x)\|$
Langevin Kernel	$x' = x - a_1\nabla u(x) + a_2\xi$	Quadratic diff. of means (see text)
Metropolis-Hastings	Accept/reject with proposal $q$	$\log p_t(x) - \log p_t(x')$
Diffusion Flow	Euler–Maruyama update of SDE	Quadratic in forward/reverse means

Deterministic invertible layers provide tractable log-determinant corrections, while stochastic kernels grant the ability to traverse topological barriers between modes, with path-space ratios ensuring unbiased endpoint statistics.

Stochastic normalizing flows unify deterministic invertible maps and stochastic mixing steps into a mathematically rigorous, trainable, and highly expressive generative modeling framework. Their applications span density estimation, inference under uncertainty, physical simulation, and statistical mechanics, offering robust guarantees and practical sampling agility.