Normalizing Flows (NFs)

Updated 5 July 2025

Normalizing flows are likelihood-based generative models that convert simple distributions into complex data densities using sequences of invertible, differentiable transformations.
They leverage architectures such as coupling layers, autoregressive and spline flows to ensure tractable Jacobian determinants and efficient maximum likelihood estimation.
Applications range from image and audio generation to Bayesian inference and reinforcement learning, with ongoing research addressing stability, expressivity, and topological challenges.

Normalizing Flows (NFs) are a class of likelihood-based generative models that construct complex probability densities by applying a sequence of invertible, differentiable transformations to a simple base distribution, commonly a standard Gaussian or uniform. Through the change-of-variables formula, NFs enable efficient and exact evaluation of likelihoods and support tractable sampling, positioning them as a unifying tool in unsupervised learning, probabilistic modeling, Bayesian inference, and other areas demanding explicit probabilistic reasoning.

1. Mathematical and Theoretical Foundations

A normalizing flow is defined as a sequence of diffeomorphisms $\{f_k\}_{k=1}^N$ transforming a base random variable $z \sim p_0(z)$ to a data variable $x = f_N \circ \ldots \circ f_1(z)$ . Using the change-of-variables formula, the transformed density is evaluated as: $p_X(x) = p_0(f^{-1}(x)) \cdot \left| \det J_{f^{-1}}(x) \right|,$ where $J_{f^{-1}}(x)$ is the Jacobian of $f^{-1}$ at $x$ . In practice, the transformation is performed in multiple steps (the "flow"), with each step’s Jacobian structured to facilitate efficient determinant computation. The log-likelihood decomposes as

$\log p_X(x) = \log p_0(f^{-1}(x)) + \sum_{k=1}^N \log \left| \det J_{f_k}(h_{k-1}) \right|,$

with $h_0 = x$ and $h_k = f_k(h_{k-1})$ .

This framework is grounded in principles of invertible mappings (diffeomorphisms) and connects classical probability transformations with modern deep learning (Kobyzev et al., 2019, Papamakarios et al., 2019). Composition of elementary flows supports universal density approximation under mild conditions (Papamakarios et al., 2019).

2. Flow Architecture and Model Variants

A critical design consideration is the structure of the invertible transformations:

Affine/Linear Flows: Simple invertible affine transformations ( $x = Ax + b$ ) with tractable determinants.
Planar/Radial Flows: Introduce non-linear "bends" in the distribution, allowing for non-affine warping; e.g., planar flow $g(x) = x + u h(w^T x + b)$ .
Coupling Layers: Partition the input into subsets, transforming one subset conditioned on the others; the Jacobian is triangular, allowing efficient determinant calculation (Kobyzev et al., 2019).
Autoregressive Flows: Masked structures (MAF/IAF) model high-dimensional data by sequentially transforming coordinates conditioned on previous ones, facilitating triangular Jacobians (Kobyzev et al., 2019).
Spline Flows: Use piecewise invertible functions (e.g., rational quadratic splines), increasing nonlinearity and expressivity (Reyes-Gonzalez et al., 2022).
Continuous Flows: Ordinary Differential Equation–based flows (Neural ODEs/FFJORD) provide infinite-depth transformation by integrating velocity fields (Kobyzev et al., 2019, Papamakarios et al., 2019).
Mixtures of Flows: Combine multiple flows in a mixture model, allowing components to specialize for complex structures (e.g., 3D point clouds) (Postels et al., 2021).
Conditional Flows: Model conditional densities $p(y|x)$ where all transformation parameters or even the base density are functions of the conditioning variable (Winkler et al., 2019).
Gradient Boosted Flows: Sequentially add normalizing flow components (boosting) to correct inadequacies of previous components, supporting modular training and improved density estimation (Giaquinto et al., 2020).
Parameter-Efficient Flows: Share parameters across flow steps using factorization and indicator embeddings, reducing the parameter complexity from linear to sublinear in the number of layers (Lee et al., 2020).

Each architecture is tailored to balance expressivity, tractability, and computational efficiency; triangular or autoregressive structures favor efficient log-determinant calculation but may restrict parallelization or require stacking for expressiveness.

3. Expressiveness, Limitations, and Recent Extensions

While NFs are theoretically universal for continuous densities on Euclidean domains, the invertibility and dimension-preserving requirements impose practical limitations:

Topological Constraints: Bijective transformations require that the support of the base and target densities are topologically equivalent. This restricts the modeling of data on lower-dimensional manifolds or with nontrivial topology (e.g., disconnected components, holes) (Kelly et al., 2023). Approximating such targets may force the transformation's bi-Lipschitz constant to diverge, resulting in poor flexibility near boundaries.
High Dimensionality: As the ambient dimension grows, performance of simple affine or coupling-based flows degrades unless the architecture is sufficiently expressive (e.g., with deep spline flows) (Reyes-Gonzalez et al., 2022). Recent work addresses this by employing more expressive nonlinearities and transformer-based architectures (Zhai et al., 9 Dec 2024).
Manifold Data: Extensions such as the inflation–deflation approach inflate a manifold by injecting noise in normal directions, train the NF on the resulting full-dimensional data, and analytically deflate the learned density to recover the manifold density (Horvat et al., 2021). ManiFlow combines density estimation with geometry-driven manifold projection using likelihood gradients for sample refinement (Postels et al., 2022).

Recent research proposes relaxations and hybridizations to address these issues:

Variational and Surjective Extensions: Introduce stochastic or surjective layers, inspired by VAEs and diffusion models, allowing flows to model topological or dimensionality mismatches (Kelly et al., 2023). SurVAE flows and score-based flows are examples.
Stochastic and Score-Based Flows: Combine deterministic invertible mappings with noise-injection or Langevin steps (e.g., Stochastic Normalizing Flows), broadening types of expressible distributions (Kelly et al., 2023).

4. Learning, Stability, and Regularization

Training NFs is commonly done via maximum likelihood estimation. However, depth and overparametrization can destabilize training, especially in high dimensions or when the data have intrinsic low dimension relative to the ambient space:

Instability: Exploding or vanishing Jacobian determinants can impede training (Andrade, 26 Feb 2024). Exponential scaling of sample magnitudes across layers can introduce high gradient variance, especially for deep flows approximating high-dimensional posteriors.
Regularization: Techniques such as Tikhonov regularization (adding an L2 penalty on the Jacobian) stabilize learning and component extraction by preventing divergence along low-variance directions (Feinman et al., 2019). In RealNVP-based flows, soft-thresholding of scale outputs and bijective soft log sample transformations ("LOFT") help bound intermediate values, control gradient variance, and improve marginal likelihood estimation (Andrade, 26 Feb 2024).
Boosted and Modular Training: Gradient boosted flows train small flow components sequentially, mitigating the optimization challenges of deep monolithic flows and enabling on-demand complexity (Giaquinto et al., 2020).
JKO Schemes: In continuous normalizing flows (CNFs), hyperparameter-sensitive kinetic energy penalties from OT-based formulations are replaced by the JKO scheme, which divides optimization into simpler proximal subproblems for improved efficiency and stability (Vidal et al., 2022).

5. Practical Applications

NFs have been applied and validated in diverse domains:

Density Estimation and Generation: Achieve state-of-the-art likelihoods and visually compelling samples in image, video, audio, and graph generation. Recent Transformer-based autoregressive flows (TarFlow) rival or surpass diffusion models in sample fidelity and likelihood estimation (Zhai et al., 9 Dec 2024).
Structured Prediction and Inverse Problems: Conditional NFs directly model multivariate targets in tasks such as super-resolution, vessel segmentation, and inverse imaging, yielding sharper results and better representing uncertainty than pixelwise or factored baselines (Winkler et al., 2019).
Bayesian Inference: NFs are used for variational inference, approximating complex posteriors (including those in high-dimensional parameter spaces), and, when integrated into optimal experimental design, provide unbiased, highly expressive variational families that improve the tightness of information gain bounds and capture multimodality (Dong et al., 8 Apr 2024).
Physics and Scientific Computing: Applications include lattice gauge theory, sampling in effective string theory (see PI-SNFs for the Nambu-Goto action (Caselle et al., 2023)), equilibrium state generation, and gravitational wave inference. NFs offer significant improvements over traditional Monte Carlo methods by globally parameterizing complex distributions.
Reinforcement Learning: Modern RL algorithms increasingly employ NFs for policies, Q-functions, and occupancy densities, benefiting from direct likelihood optimization and scalable sampling as an alternative to diffusion or transformer models (Ghugare et al., 29 May 2025). NFs streamline algorithmic design and can outperform or match more complex generative models in imitation learning, offline RL, and unsupervised settings.
Point Cloud Modeling: Mixtures of flows specialize in subregions of 3D shapes, enabling parameter-efficient and more detailed generation and inference for point cloud data (Postels et al., 2021).

6. Recent Advances and Future Directions

Recent work drives the field toward greater expressivity, efficiency, and application breadth:

Autoregressive and Transformer-Based Flows: Employing transformers in place of masked MLPs (e.g., TarFlow) enhances context modeling for images and supports guidance and denoising techniques that close the gap between likelihood-based models and diffusion models in visual sample quality (Zhai et al., 9 Dec 2024).
Parameter Efficiency: Sharing parameters across flow steps (as in NanoFlow), weighted channel shuffling, and entropy-guided multi-scale modeling continue to reduce the memory and computational costs of scaling NFs to large inputs (Lee et al., 2020, Chen et al., 6 Jul 2024).
Manifold and Topology-Adapted Flows: Methods such as inflation–deflation, manifold projection via score gradients, and integrating non-bijective/stochastic layers expand the applicability of NFs to data with low-dimensional or complex support structures (Horvat et al., 2021, Postels et al., 2022, Kelly et al., 2023).
Optimal Transport Alignment: Post-processing of flows into Monge maps via geodesic, measure-preserving transformations allows density-preserving minimization of transport cost, yielding mappings with desirable geometric properties while maintaining likelihoods (Morel et al., 2022).

Open problems include further improving training stability and sample efficiency in very high dimensions, extending to discrete and manifold-valued data, balancing trade-offs between expressivity and computational tractability, and developing new architectural motifs (including non-coupling, self-attentive, or stochastic layers) that retain the advantages of invertibility (Kelly et al., 2023, Andrade, 26 Feb 2024).

7. Summary Table: Core Properties of Major NF Architectures

Architecture	Expressivity	Determinant Computation	Sample Efficiency	Example Uses
Coupling Flow	Moderate (stacked)	Fast (triangular)	Very fast	Image/audio generation
Autoregressive	High (permutation equiv.)	Fast (triangular)	Slow in one direction	Density estimation
Spline-based	Very High	Fast (piecewise)	Moderate	High-dim models, HEP
Mixtures	Very High (compositional)	N/A (per component)	Efficient per mixture	Point clouds, structured data
Continuous (CNF)	High (ODE-based)	Trace along path	Fast, supports VI	Bayesian inference, physics
Transformer/MAF	Very High	Fast (autoregressive)	Sequential but scalable	High-res images, RL

A plausible implication is that as more expressive and efficient NF architectures mature, their adoption across scientific, industrial, and data-driven domains is expected to broaden, especially in applications where exact likelihoods, efficient sampling, and interpretable probabilistic modeling are central requirements.