Preconditioned Flow & Score Matching

Updated 5 April 2026

The paper demonstrates that coupling preconditioning maps with score and flow matching effectively alleviates optimization bottlenecks in generative models.
It shows that applying normalizing flows to whiten data distributions reduces covariance-induced slowdowns, leading to significant improvements in metrics like FID.
Practical guidelines include interleaving preconditioner and model updates with diffusion score matching to achieve stable, efficient training under favorable geometric conditions.

Preconditioned flow and score matching constitute a family of generative model training frameworks that address geometric and optimization challenges arising from ill-conditioned intermediate distributions. These methods leverage the relationship between model dynamics, covariance structure, and learning efficiency, using invertible transformations—such as normalizing flows—to systematically improve convergence and final sample quality. Central to these approaches is the insight that optimization slows dramatically in low-variance directions of the data distribution, and that preconditioning the learning process can mitigate or avoid such bottlenecks altogether.

1. Theoretical Foundations: Flow Matching and Score Matching

Flow matching and score-based methods model generative processes by training vector fields or score functions to interpolate between a tractable reference distribution ( $p_0$ ) and a complex target distribution ( $p_1$ ). In flow matching, the model is trained to match the ground-truth velocity field along a deterministic interpolation path: $x_t = s(t)\,x_1 + c(t)\,x_0,$ with $x_0 \sim p_0$ , $x_1 \sim p_1$ . The loss is: $\mathcal L_{\mathrm{flow}}(\theta) = \mathbb E_{x_0, x_1} \int_0^1 \left\| v_\theta(x_t, t) - v_t^\star(x_t) \right\|^2 dt,$ where $v_t^\star(x_t)$ denotes the prescribed velocity.

In score-based diffusion models, one considers a forward SDE mapping $x_0 \sim p_1$ to marginals $p_t(x)$ along a noise-infused path. The score function $s_\theta(x, t) \approx \nabla_x \log p_t(x)$ is learned with the denoising score matching loss: $p_1$ 0 Both frameworks reduce to solving least-squares regression under $p_1$ 1 at fixed $p_1$ 2; the geometric structure (specifically, the covariance $p_1$ 3) of $p_1$ 4 governs the optimization landscape (Ahamed et al., 2 Mar 2026).

2. Covariance Geometry and the Optimization Bottleneck

For linearly interpolated Gaussians, the covariance of $p_1$ 5 is

$p_1$ 6

with $p_1$ 7. Its eigenvalues $p_1$ 8 (where $p_1$ 9 are $x_t = s(t)\,x_1 + c(t)\,x_0,$ 0's eigenvalues) determine the learning dynamics along each direction.

The conditioning $x_t = s(t)\,x_1 + c(t)\,x_0,$ 1 increases with $x_t = s(t)\,x_1 + c(t)\,x_0,$ 2; at early $x_t = s(t)\,x_1 + c(t)\,x_0,$ 3, all directions are equally weighted, but at late $x_t = s(t)\,x_1 + c(t)\,x_0,$ 4 (as $x_t = s(t)\,x_1 + c(t)\,x_0,$ 5) the smallest eigenvalues $x_t = s(t)\,x_1 + c(t)\,x_0,$ 6 diminish if $x_t = s(t)\,x_1 + c(t)\,x_0,$ 7 is ill-conditioned. Gradient descent updates $x_t = s(t)\,x_1 + c(t)\,x_0,$ 8 decay rapidly in high-variance directions, but only slowly in suppressed modes. This produces a two-fold slowdown: both deterministic convergence and stochastic gradient noise scale poorly with ill-conditioning,

$x_t = s(t)\,x_1 + c(t)\,x_0,$ 9

leading to suboptimal plateaus in model performance (Ahamed et al., 2 Mar 2026).

3. Preconditioning Maps and Invertible Transformations

Preconditioning addresses the covariance-induced bottleneck by applying an invertible map $x_0 \sim p_0$ 0 to reshape $x_0 \sim p_0$ 1. The goal is to whiten or Gaussianize the data distribution so that $x_0 \sim p_0$ 2.

Two practical approaches to constructing $x_0 \sim p_0$ 3 are:

Normalizing flow preconditioner: $x_0 \sim p_0$ 4 is trained via maximum likelihood to satisfy $x_0 \sim p_0$ 5. The generative model is trained in preconditioned space and sampling proceeds by inversion.
Low-capacity flow preconditioner: $x_0 \sim p_0$ 6 is fit by flow matching between $x_0 \sim p_0$ 7, offering a lightweight alternative with less modeling capacity (Ahamed et al., 2 Mar 2026).

In both cases, the overall generative model family ( $x_0 \sim p_0$ 8) is unchanged, but optimization proceeds under substantially improved geometric conditions.

4. Preconditioned Score Matching and Diffusion Score Matching

Diffusion Score Matching (DSM) generalizes Hyvärinen's score matching by introducing a diffusion/preconditioning matrix $x_0 \sim p_0$ 9. The DSM loss is formally

$x_1 \sim p_1$ 0

It has been established that DSM using $x_1 \sim p_1$ 1, with $x_1 \sim p_1$ 2 an invertible flow, is exactly ordinary score matching in the latent space $x_1 \sim p_1$ 3: $x_1 \sim p_1$ 4 Thus, DSM with flow-induced preconditioning transforms the problem to one with more favorable geometry, and $x_1 \sim p_1$ 5 can be learned to optimize convergence (Gong et al., 2021).

Furthermore, this preconditioning can be interpreted geometrically as introducing a Riemannian metric $x_1 \sim p_1$ 6 and computing the Fisher divergence on the induced manifold.

5. Algorithmic Implementation and Practical Guidelines

The preconditioned flow matching algorithm interleaves updates to the preconditioning map and the flow (or score) model. A typical procedure includes:

Optionally training $x_1 \sim p_1$ 7 to whiten $x_1 \sim p_1$ 8 samples via maximum-likelihood.
Sampling $x_1 \sim p_1$ 9, $\mathcal L_{\mathrm{flow}}(\theta) = \mathbb E_{x_0, x_1} \int_0^1 \left\| v_\theta(x_t, t) - v_t^\star(x_t) \right\|^2 dt,$ 0, then forming $\mathcal L_{\mathrm{flow}}(\theta) = \mathbb E_{x_0, x_1} \int_0^1 \left\| v_\theta(x_t, t) - v_t^\star(x_t) \right\|^2 dt,$ 1.
Mapping to preconditioned space $\mathcal L_{\mathrm{flow}}(\theta) = \mathbb E_{x_0, x_1} \int_0^1 \left\| v_\theta(x_t, t) - v_t^\star(x_t) \right\|^2 dt,$ 2 via $\mathcal L_{\mathrm{flow}}(\theta) = \mathbb E_{x_0, x_1} \int_0^1 \left\| v_\theta(x_t, t) - v_t^\star(x_t) \right\|^2 dt,$ 3.
Evaluating regression targets and losses in preconditioned space.
Updating model and preconditioner parameters (Ahamed et al., 2 Mar 2026).

Score matching via DSM or in EDM-style preconditioned denoising regression benefits from time-dependent normalization of inputs, targets, and loss weighting to enforce uniform optimization properties across $\mathcal L_{\mathrm{flow}}(\theta) = \mathbb E_{x_0, x_1} \int_0^1 \left\| v_\theta(x_t, t) - v_t^\star(x_t) \right\|^2 dt,$ 4 (Yang et al., 11 Dec 2025). Preconditioner architectural choices impact both conditioning and computational cost (e.g., coupling-layer NFs for tractable Jacobians; small MLPs for latent domains).

6. Empirical Evaluation and Impact

Preconditioning yields substantial empirical gains across domains:

On MNIST latent space (via VAE), normalizing flow preconditioning reduces $\mathcal L_{\mathrm{flow}}(\theta) = \mathbb E_{x_0, x_1} \int_0^1 \left\| v_\theta(x_t, t) - v_t^\star(x_t) \right\|^2 dt,$ 5 by $\mathcal L_{\mathrm{flow}}(\theta) = \mathbb E_{x_0, x_1} \int_0^1 \left\| v_\theta(x_t, t) - v_t^\star(x_t) \right\|^2 dt,$ 6– $\mathcal L_{\mathrm{flow}}(\theta) = \mathbb E_{x_0, x_1} \int_0^1 \left\| v_\theta(x_t, t) - v_t^\star(x_t) \right\|^2 dt,$ 7 across $\mathcal L_{\mathrm{flow}}(\theta) = \mathbb E_{x_0, x_1} \int_0^1 \left\| v_\theta(x_t, t) - v_t^\star(x_t) \right\|^2 dt,$ 8 and improves FID from $\mathcal L_{\mathrm{flow}}(\theta) = \mathbb E_{x_0, x_1} \int_0^1 \left\| v_\theta(x_t, t) - v_t^\star(x_t) \right\|^2 dt,$ 9 (no PC) to $v_t^\star(x_t)$ 0 (NF-PC) (Ahamed et al., 2 Mar 2026).
On high-resolution image datasets (LSUN Churches, Oxford Flowers-102, AFHQ Cats), preconditioned flows achieve lower $v_t^\star(x_t)$ 1 and improve FID, also eliminating blur and repeated patterning (Ahamed et al., 2 Mar 2026).
In speech enhancement with flow matching, EDM-style preconditioned $v_t^\star(x_t)$ 2 prediction improves convergence speed (2× faster), stabilizes learning, and achieves the best or equal best performance across PESQ and SI-SDR relative to baseline and un-preconditioned objectives (Yang et al., 11 Dec 2025).

Key diagnostic and practical recommendations include tracking $v_t^\star(x_t)$ 3 during training to monitor emergent ill-conditioning, initializing with simple latent-space preconditioners, and combining with loss reweighting or adaptive optimizers.

7. Connections to Broader Frameworks and Theoretical Unification

The principles behind preconditioned flow and score matching extend to Minimum Probability Flow (MPF) (0906.4779), which frames learning as minimizing the instantaneous KL rate out of the data distribution under prescribed dynamics. For continuous-state Gaussian flows, MPF reduces to (possibly preconditioned) score matching, with explicit analytic connection through the infinitesimal time limit. MPF, DSM, and preconditioned flow matching all leverage user- or data-driven design of dynamics, connectivity, or metric structure to enhance efficiency and stability of model fitting.

Collectively, these developments establish that preconditioning—originating in stochastic optimization—enables a principled, data-adaptive solution to optimization obstacles in generative modeling, providing robust convergence and consistently improved sample fidelity across domains.

Markdown Report Issue Upgrade to Chat

References (4)

Preconditioned Score and Flow Matching (2026)

Interpreting diffusion score matching using normalizing flow (2021)

Investigating training objective for flow matching-based speech enhancement (2025)

Minimum Probability Flow Learning (2009)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Preconditioned Flow and Score Matching.