Normalized Flow Matching (NFM)

Updated 13 March 2026

Normalized Flow Matching (NFM) is a technique that trains neural-ODE generative models using a structured coupling from a pretrained normalizing flow.
It employs a coupling-distillation paradigm to achieve faster training, reduced sampling curvature, and enhanced sample quality compared to independent or OT couplings.
Experimental results on ImageNet datasets demonstrate that NFM offers significant inference speedups and lower FID scores, outperforming the AR-NF teacher model.

Normalized Flow Matching (NFM) is a method for training neural ordinary differential equation (neural-ODE) based generative models through a coupling-distillation paradigm. It reframes the traditional flow matching (FM) learning process by leveraging the bijective mapping of a pretrained normalizing flow (NF) model as a structured coupling for endpoint pairs. This approach enables faster training, reduced sampling curvature, and improved sample quality compared to independent or optimal-transport (OT) couplings, with inference speed significantly surpassing that of the AR-NF teacher model (Berthelot et al., 9 Mar 2026).

1. Flow Matching and the Role of the Coupling Measure

Flow Matching (FM) generative modeling learns a time-dependent velocity field $v_\theta(x, t)$ to transform points from a source distribution (typically Gaussian noise, $p_{\rm noise}(z)$ ) to a target distribution (data, $p_{\rm data}(x)$ ). The core mechanism is a regression loss defined over data/noise pairings and interpolated positions: $x_t = (1-t)x + t z, \quad t \in [0,1].$ A coupling measure $q(x, z)$ with $q(x) = p_{\rm data}(x)$ and $q(z) = p_{\rm noise}(z)$ determines how $x$ and $z$ are paired. The FM objective is given by: $\mathcal{L}_{\nabla}(\theta) = \mathbb{E}_{q(x, z, t)} \left[ \| v_\theta(x_t, t) - (z - x) \|_2^2 \right],$ where $q(x, z, t) = q(x, z)\, \delta(x_t - (1-t)x - t z) \, p(t)$ and $p(t)$ is typically uniform or logit-normal. The statistical structure of $q(x,z)$ has first-order impacts on speed of convergence, stability, and variance of the regression target. Standard FM uses an independent coupling, while alternative adaptive couplings such as OT have shown improvements.

2. Distilling Couplings from Normalizing Flow Teachers

A normalizing flow (NF), $f_{\rm NF}$ , is an invertible map ( $\mathbb{R}^n \to \mathbb{R}^n$ ) learned via maximum likelihood. The bijectivity allows construction of a quasi-deterministic coupling:

Slightly noised data, $x' = x + \eta\epsilon'$ , with $\epsilon' \sim \mathcal{N}(0, I)$ , are mapped via the NF,
The forward map yields $\tilde{z} = f_{\rm NF}(x', c)$ , with normalization by the empirical pre-image standard deviation $\sigma_f$ : $z_{\epsilon'} = \frac{1}{\sigma_f} f_{\rm NF}(x + \eta\epsilon', c).$
A coupling density is defined as: $q_{\rm NF}(x, z) = p_{\rm data}(x)\int \delta(z - z_{\epsilon'}) \mathcal{N}(\epsilon')\, d\epsilon'.$ This coupling is substituted into the FM loss, dissociating from the typical $p(x)p(z)$ or numerically expensive OT approaches. The distilled NFM objective becomes: $\mathcal{L}_{\rm NFM}(\theta) = \mathbb{E}_{x, \epsilon', t}\left[ \| v_\theta(x_t, c, t) - (z_{\epsilon'} - x) \|_2^2 \right], \quad x_t = (1-t)x + t z_{\epsilon'}.$ This effectively enforces FM with a highly structured and nearly deterministic endpoint matching.

3. Key Equations, Training, and Inference Procedures

The essential equations underpinning FM and NFM are as follows:

Generic path-wise FM loss: $\mathcal{L}_{\nabla}(\theta) = \mathbb{E}_{q(x, z), p(t)} \left\| v_\theta((1-t)x + t z, t) - (z - x) \right\|_2^2$
NF-induced target: $z_{\epsilon'} = \frac{1}{\sigma_f} f_{\rm NF}(x + \eta \epsilon', c)$
NFM distilled loss: $\mathcal{L}_{\rm NFM} = \mathbb{E}_{x, \epsilon', t} \left[ \left\| g(x_t, c, t) - (z_{\epsilon'} - x) \right\|_2^2 \right]$

Training pseudocode:

while not converged:
    x, c = sample_data_minibatch()
    ε′ = sample_noise()
    z = (1/σ_f) * f_NF(x + η * ε′, c)
    t = sample_time()
    x_t = (1-t) * x + t * z
    v_target = z - x
    v_pred = g_θ(x_t, c, t)
    loss = mean_squared_error(v_pred, v_target)
    θ = update_params(θ, loss)

Sampling pseudocode (Euler or Heun solver):

z_S = sample_standard_normal()
for k in S,..,1:
    t = k/S      # or t^2 schedule
    Δt = 1/S
    x_{k-1} = x_k - Δt * g_θ(x_k, c, t)   # Euler

4. Model Architectures and Hyperparameters

The teacher AR-NF model (“TarFlow”) uses Transformer-based autoregressive coupling layers (“meta-blocks”):

Configuration example: TF-6×2+2×26-1024/2 (6 meta-blocks depth 2, 2 meta-blocks depth 26, 1024 dims, patch size 2).
Parameter count: ≈0.82M.

The student FM network $g$ :

Shares patch embedding and Transformer block configuration with the teacher (SiT-XL/4 style), lacks invertibility constraint.
Embeds time and class information via learned embeddings.
Parameter count: ≈0.8M.

Training settings:

Datasets: ImageNet-64, ImageNet-256 (with 256×256 images embedded via pretrained VAE).
Teacher noise $\eta$ : 0.05 (64×64), 0.10 (256×256).
Teacher trained on 512M samples (~420 epochs), student and FM baselines on 256M (210 epochs).
Optimizer: Adam ( $\beta_1=0.9$ , $\beta_2=0.999$ ), learning rate ≈ $4\times10^{-4}$ .
Batch size: 1024.
Time sampler: logit-normalDistr( $a=-0.2$ , $b=1$ ).
Label dropout: $p=0.1$ .
Sampler: Euler for NFE $\leq$ 5, Heun otherwise.
Guidance: classifier-free, tuned by golden-section search on FID and fixed.

5. Comparative Evaluation and Ablation Studies

Experimental comparisons on ImageNet-64 (class-conditional FM):

Number of function evals (NFE)	Independent FM FID	OT (Semi-Discrete FM)	NFM (ours)
31 (Heun)	2.57	2.68	1.78
15 (Heun)	4.80	3.15	2.15
7 (Heun)	13.01	6.41	3.23
5 (Euler)	17.56	9.29	4.01

On ImageNet-256 (NFE=31/15/7), NFM consistently outperforms both baseline and OT-FM across FID, with up to $5\times$ improvement in low-NFE regimes, and surpasses the AR-NF teacher FID (teacher: 1.98 on 64×64, student: 1.78).

Curvature metrics (lower $\kappa$ is straighter):

Solver / NFE	FM $\kappa$	OT-FM $\kappa$	NFM $\kappa$
Heun(31)	0.0864	0.0767	0.0435
Euler(128)	0.0386	0.0289	0.0181

At 31 NFE, NFM achieves a $32\times$ reduction in latency over the AR-NF teacher while maintaining or improving sample quality.

Ablations:

Teacher NLL performance linearly correlates with student FID.
Optimal $\eta$ aligns between best teacher FID and best student FID; large $\eta$ ( $>$ 0.2) harms both.
NF latent $z$ -space exhibits high non-isometry, but couplings remain effective.
Qualitative samples show NFM behaves closer to distribution matching than strict pair-wise distillation.

6. Implications and Practical Significance

Normalized Flow Matching demonstrates that leveraging the invertible nature of a pretrained NF model enables a direct and structured coupling for FM learning. This provides substantial improvements:

Faster training convergence (reduced number of iterations and loss).
Improved sample quality (lower FID, especially with few function evaluations).
Sampling speedup of $30$– $100\times$ over the AR-NF teacher.
Student models can surpass the FID of their teachers, suggesting the distilled coupling leads to superior learned flows (Berthelot et al., 9 Mar 2026).

A plausible implication is that NFM’s approach to coupling reuse generalizes broadly across flow-based models, opening avenues for more efficient distillation paradigms in generative modeling. Open-source code, architectures, and training configurations are provided at github.com/apple/ml-nfm for reproducibility.

Markdown Report Issue Upgrade to Chat

References (1)

The Coupling Within: Flow Matching via Distilled Normalizing Flows (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Normalized Flow Matching (NFM).