Continuous Normalizing Flows (CNFs)

Updated 9 November 2025

Continuous Normalizing Flows (CNFs) are generative models that use neural ODEs to continuously transform base distributions, enabling exact likelihood evaluations.
They compute the log-density by integrating the instantaneous divergence along sample paths, ensuring controlled and stable transformations.
CNFs incorporate optimal transport theory and geometric regularization, bridging advanced statistical estimation with scalable deep learning frameworks.

Continuous normalizing flows (CNFs) are a class of generative models that realize invertible stochastic transformations by solving an ordinary differential equation (ODE) parameterized by a neural network, thereby providing an expressive framework for density estimation, generative modeling, and modern probabilistic inference. CNFs generalize the traditional concept of normalizing flows from discrete compositions to the continuum, offering explicit control of the log-determinant Jacobian via integration of an instantaneous trace along sample paths. These models have deep connections to neural ODEs, optimal transport theory, and geometric statistics.

1. Core Definition and Theoretical Foundations

A continuous normalizing flow is characterized by a time-dependent vector field $f_\theta: \mathbb{R}^d \times [0,T] \to \mathbb{R}^d$ —commonly instantiated as a neural network—defining an ODE

$\frac{dz}{dt} = f_\theta(z(t), t), \quad z(0) \sim p_0(z),$

where $p_0$ is a tractable base distribution (often standard normal), with $z(T)$ targeting the data distribution. The fundamental contribution of CNFs is the continuous change-of-variables formula, which captures the infinitesimal evolution of density along the flow: $\frac{d}{dt} \log p_t(z(t)) = -\nabla_z \cdot f_\theta(z(t), t).$ Integrating in $t$ yields the log-density at $z(T)$ : $\log p_T(z(T)) = \log p_0(z(0)) - \int_0^T \nabla_z \cdot f_\theta(z(t), t) \, dt .$ This enables exact likelihood evaluation under invertibility conditions and forms the basis of maximum likelihood or flow-matching training of CNFs. The continuous-time viewpoint seamlessly extends to manifold-valued domains; on a Riemannian manifold $(\mathcal{M}, g)$ , the divergence operator generalizes using the underlying geometric structure (Mathieu et al., 2020, Falorsi, 2021, Ben-Hamu et al., 2022).

2. Connections to Optimal Transport and Wasserstein Flows

Recent advances have anchored CNF regularization and training in the geometry of optimal transport and Wasserstein gradient flows. In this setting, the Benamou–Brenier formulation interprets the optimal transport map between two distributions as a solution to a dynamic minimization involving kinetic energy: $\min_{v_t} \int_0^T \mathbb{E}_{z \sim p_t} \frac{1}{2}\|v_t(z)\|^2 dt \quad \text{s.t.} \quad \partial_t p_t + \nabla_z \cdot (p_t v_t) = 0, \quad p_0, p_T \text{ fixed}.$ Under this regime, CNFs can be regularized by enforcing the velocity field to follow the Wasserstein gradient flow for a chosen functional $\mathcal{F}$ , for example by imposing

$f_\theta(z, t) \approx -\nabla_z \frac{\delta \mathcal{F}}{\delta p}(z)$

and directly incorporating a velocity-alignment regularizer into training objectives: $\lambda \int_0^1 \mathbb{E}_{z \sim p_t} \| f_\theta(z, t) + \nabla_z \frac{\delta \mathcal{F}}{\delta p}(z) \|^2 dt$ (Hou, 2023, Onken et al., 2020, Vidal et al., 2022). This coupling embeds CNFs within the solution space of Fokker–Planck equations and guarantees well-behaved transport, minimal trajectory complexity, and an interface with the Jordan–Kinderlehrer–Otto (JKO) proximal scheme for iteratively minimizing Kullback–Leibler divergence in Wasserstein space.

3. Regularization, Geometric Structure, and Robustness

Embedding CNFs in geometric frameworks stabilizes training, reduces the risk of degenerate or stiff flows, and enhances robustness—particularly relevant in finite-sample regimes or causal inference. For instance, in the context of causal effect estimation, regularizing the path of the flow such that its instantaneous score function aligns with the efficient influence function minimizes variance in statistical estimation: $g_t(z) = \frac{\partial}{\partial t} \log p_t(z) = -\nabla \cdot f_\theta(z, t) \approx \varphi(p_t)(z)$ where $\varphi(p_t)$ is the efficient influence function for a target functional (e.g., counterfactual mean) at $p_t$ (Hou, 2023). When parameters are chosen so that $f_\theta$ represents the negative Wasserstein gradient of the variance (e.g., $f(x, t) = -x$ for mean estimation), CNFs directly attain the Cramér–Rao lower bound at each $p_t$ , outperforming traditional semiparametric estimators such as TMLE and AIPW.

4. Implementation: Training, Inference, and Algorithmic Structure

CNFs are commonly implemented with the neural ODE paradigm, leveraging either the adjoint sensitivity method for memory-efficient gradients or a discretize-then-optimize (Disc-Opt) approach for improved gradient accuracy and computational predictability (Onken et al., 2020, Onken et al., 2020). The log-density is computed by augmenting the state with a scalar $\ell(t)$ satisfying

$\frac{d\ell}{dt} = \nabla_z \cdot f_\theta(z(t), t)$

and numerically integrating simultaneously with $z(t)$ . Trace estimation for divergence is performed either exactly (when parameterization admits a closed-form, as in certain OT-inspired CNFs), or via Hutchinson’s stochastic estimator for high-dimensional settings.

Algorithmically, advanced CNFs introduce proximal updates via the JKO scheme (Vidal et al., 2022) or directly regularize against OT-theoretic or influence-function objectives (Hou, 2023). The following high-level structure is prototypical for geometry-aware CNF training:

for minibatch in data:
    # a) Sample latent base z0 ~ p0
    # b) Integrate z(t) = ODESolve(z0, f_theta, t=[0,1])
    # c) Compute log p1(z(1)) using base density and accumulated divergence
    # d) Primary loss: negative log-likelihood of data
    # e) Velocity-alignment loss: E_{t, z ~ p_t} ||f_theta(z, t) + nabla_z (delta F / delta p)(z)||^2
    # f) Total loss: sum of primary and regularization terms
    # g) Backpropagate through ODE (adjoint, Disc-Opt) and update parameters

Exact trace evaluation, efficient neural vector field parameterizations, and choice of ODE solver (e.g., Dormand–Prince, RK4 with adaptive or fixed steps) are critical for scalable and numerically robust training.

5. Empirical Performance and Statistical Guarantees

Empirical studies demonstrate that geometry-aware and OT-regularized CNFs yield strictly improved mean-squared error in plug-in estimators for statistical functionals, especially under finite-sample regimes where bias-variance trade-offs are prominent (Hou, 2023). In causal inference toy experiments (e.g., 8-Gaussians, Pinwheel), such CNFs reduce estimator variance and total MSE along the path $p_t$ (interpolating from $p_0$ to $p_1$ ) compared to naïve flows. The theoretical machinery guarantees that efficient influence functions are attained along the path, achieving the semiparametric Cramér–Rao lower bound at each $p_t$ .

The framework thus provides a robust bridge between flexible generative modeling and rigorous statistical estimation—crucial for scientific applications, simulation-based inference, and scenarios where parameter estimation bias can arise from sample/population mismatches.

6. Broader Implications and Future Directions

The integration of geometric and optimal transport perspectives into CNF design enables models that are robust, statistically efficient, and better aligned with theoretical desiderata in density estimation, inference, and causal analysis. Wasserstein-regularized CNFs and proximal JKO flows provide divide-and-conquer strategies for stabilizing convergence and mitigating sensitivity to hyperparameters such as penalty weights.

Open questions include the generalization of velocity-alignment regularization to complex target functionals, principled selection of transport cost functionals $\mathcal{F}$ in empirical settings, extension to manifold-valued data and structured outcomes, and the scaling of exact-trace or closed-form divergences to architectures beyond fully connected and low-rank models.

Recent research also explores the combination of CNFs with adaptive MCMC, multi-resolution data decompositions, and manifold-specific ODE integration, continuing to expand the expressivity and theoretical foundation of this modeling paradigm (Gerdes et al., 17 Oct 2024, Voleti et al., 2021, Ben-Hamu et al., 2022).