Flow Matching Pre-Training

Updated 20 May 2026

Flow Matching Pre-Training is a generative modeling strategy that learns a time-dependent velocity field to transport simple distributions to complex data distributions.
It employs pre-training methods such as latent variable modeling, geometric matching, and diffusion alignment to reduce training variance and computational cost.
Empirical applications in image, speech, and physics domains show improved efficiency, enhanced fidelity metrics, and accelerated convergence in high-dimensional settings.

Flow Matching Pre-Training refers to a class of generative modeling approaches in which the objective is to learn a time-dependent vector field (velocity field) whose associated trajectories (flows) transport samples from a simple source distribution (typically isotropic Gaussian noise) to a complex data distribution. Pre-training in this context denotes the use of auxiliary tasks or models—such as variational autoencoders (VAEs), geometric matching, or stochastic consistency objectives—to initialize or guide the flow-matching network, often to accelerate convergence, improve robustness, or encode data-structure priors. The paradigm has been applied to image, speech, physics-informed, and representation learning domains, and encompasses diverse strategies such as latent variable modeling, direct flow alignment from pre-trained diffusion models, and explicit low-variance estimators for the target velocity field.

1. Foundations of Flow Matching Pre-Training

Flow matching as introduced by Lipman et al. formalizes generative modeling as learning a velocity field $v_\theta(x,t)$ that defines an ordinary differential equation (ODE) whose solution maps a source distribution $p_0$ (e.g., $\mathcal{N}(0,I)$ ) to a data distribution $p_1$ over the interval $t\in[0,1]$ . In practice, the learned flows are parameterized via deep neural networks and trained to minimize a pathwise squared error between $v_\theta$ and an analytically constructed "true" velocity field along stochastic or deterministic interpolation paths between $p_0$ and $p_1$ .

Pre-training in the flow matching context often involves one or more of the following:

Learning data manifold structure or feature representations via latent variable models (Samaddar et al., 7 May 2025).
Leveraging geometric or masked modeling objectives for improved feature extraction (Dong et al., 2023, Weinzaepfel et al., 2022).
Distilling representations or velocity fields from pre-trained diffusion models or consistency flows (Schusterbauer et al., 2 Jun 2025, Boffi et al., 2024).
Utilizing explicit, tractable estimators for the optimal vector field, thereby reducing training variance (Ryzhakov et al., 2024).

The goal of pre-training is to decouple difficult density estimation from flow learning, lower sample or computational complexity, and transfer inductive biases, often yielding substantial improvements in both efficiency and sample quality.

2. Latent Variable Pre-Training Methods

A prominent instance is the Latent-CFM framework, which first fits a deep latent variable model (VAE or GMM) to the data distribution:

$p_1(x_1) = \int p(f)\,p_\psi(x_1|f)\,df, \quad p(f)=\mathcal{N}(0,I)$

where $f$ is a low-dimensional latent. In the VAE case, an encoder $p_0$ 0 and a decoder $p_0$ 1 are trained to maximize an ELBO. After VAE convergence, all layers but the last are frozen, and only the final encoder layer is updated during flow-matching fine-tuning.

The flow-matching objective is conditioned on both data and latent:

$p_0$ 2

with an additional KL regularization on the encoder posterior. Crucially, this scheme enables the velocity field $p_0$ 3 to focus on structured variations encoded in $p_0$ 4, significantly reducing the sample and optimization complexity.

Empirical results show Latent-CFM achieves similar or better FID scores on image benchmarks (MNIST, CIFAR-10) and physical field generation (Darcy flow) with roughly half the number of gradient steps required by unconditional flow matching (Samaddar et al., 7 May 2025).

3. Alignment with Diffusion and Consistency Models

Pre-training strategies have been devised to align pre-trained diffusion models and flow-matching networks. Diff2Flow (Schusterbauer et al., 2 Jun 2025) proposes a systematic conversion:

Timesteps from discrete diffusion are mapped to the continuous flow-matching interval via a rescaling function $p_0$ 5.
Interpolants and noisy paths are realigned, and diffusion model predictions (e.g., the $p_0$ 6-parameterization) are analytically transformed into FM-compatible velocity fields.
Fine-tuning involves optimizing the standard FM loss, with or without parameter-efficient adapters (LoRA), directly leveraging the prior knowledge stored in large diffusion backbones.

This approach enables data-efficient transfer to FM objectives, achieves competitive or superior FID and downstream task metrics (e.g. depth estimation, text-to-image), and admits inference at substantially reduced numerical cost (as few as two to four steps) compared to diffusion baselines.

Similarly, Flow Map Matching (FMM) (Boffi et al., 2024) unifies flow-matching, consistency models, and distillation into a single mathematical framework using stochastic interpolants and two-time flow maps, permitting direct or teacher-driven pre-training to obtain high-fidelity one-shot or few-step samplers.

4. Explicit Loss Estimators and Low-Variance Training

Explicit Flow Matching (ExFM) (Ryzhakov et al., 2024) provides a theoretically grounded, gradient-equivalent alternative to standard conditional flow matching losses by analytically averaging the target vector field over the data endpoint distribution:

$p_0$ 7

where $p_0$ 8 is the velocity corresponding to the interpolant, and $p_0$ 9 is the conditional endpoint density. This integral is estimated via increased Monte Carlo samples. ExFM yields dramatically lower gradient variance, accelerates convergence, and produces optimal vector fields that can be characterized in closed form for certain distributions.

ExFM can serve as a pre-training method for large flows or as a warm-start for hybrid flow–diffusion frameworks, accelerating convergence, particularly in high-dimensional or multi-modal settings.

5. Domain-Specific Applications: Physics, Vision, Speech

Physics-Informed and Multi-Scale Pre-Training

Multi-Fidelity Flow Matching (MFFM) applies flow matching in a cascade of refinement steps between grids of increasing resolution for PDE solutions (Chen et al., 15 May 2026). Each stage is pre-trained independently, using a source distribution calibrated to empirical residuals and conditioning on low-fidelity solutions. This architecture admits one-step-per-level deterministic rollouts analogous to multigrid methods and benefits from the normalization and geometric adaptation induced by flow-matching pre-training, enabling accurate, scalable surrogate modeling of complex physical systems.

Flow-Marching (Chen et al., 23 Sep 2025) extends this principle by bridging deterministic operator learning and flow-based generative modeling, joint VAE compression, and specialized transformer architectures for large-scale PDE sequence modeling, yielding improved rollout stability and few-shot adaptation.

Vision and Optical Flow

Pre-training feature extractors on geometric matching or masked image modeling objectives, as in MatchFlow (Dong et al., 2023) and CroCo v2 (Weinzaepfel et al., 2022), establishes robust feature representations which—when fine-tuned under flow-matching or nonparametric loss—substantially reduce endpoint prediction error and improve generalization across real-world and synthetic benchmarks. Techniques such as dual-softmax loss, masked token prediction, and relative positional encodings are critical for flow-specific pre-training efficacy.

Speech and Representation Learning

In the speech domain, models such as SpeechFlow (Liu et al., 2023) pre-train transformer networks with flow-matching objectives over large unlabeled corpora, using masked conditioning and time-dependent velocity regression for high-fidelity synthesis, enhancement, and separation. Joint training of representations and velocity fields in a flow-matching objective (as in FlowFM (Ukita et al., 17 Dec 2025)) enables the simultaneous acquisition of discriminative features and generative capacity, yielding state-of-the-art recognition and rapid generation at low compute cost.

6. Implementation Protocols, Training Efficiency, and Empirical Outcomes

Key training protocols across methods include:

Minibatch sampling schemes matching $\mathcal{N}(0,I)$ 0 pairs from base and data distributions, and time/step variables from $\mathcal{N}(0,I)$ 1.
Architectural choices such as U-Net and Transformer backbones, often adapted for flow-matching objectives via specialized conditioning (time embeddings, FiLM layers, latent vectors).
Use of explicit regularization (e.g., $\mathcal{N}(0,I)$ 2-regularized KL in VAE-based pre-training), batch normalization, and adaptive solvers for ODE integration at sample generation.
Empirical findings consistently indicate 2–4x reductions in training steps or computational cost, improved or matched sample quality (measured by FID, W2 error, NLL, downstream task accuracy), and enhanced interpretability or controllability of generation via latent traversals or conditional inputs (Samaddar et al., 7 May 2025, Ryzhakov et al., 2024, Liu et al., 2023, Ukita et al., 17 Dec 2025, Chen et al., 23 Sep 2025, Chen et al., 15 May 2026).

Tables summarizing outcomes for representative models:

Model/Domain	Principal Method	Efficiency Gain	Sample Quality Metric
Latent-CFM (images/PDE)	Latent variable pre-training	$\mathcal{N}(0,I)$ 3 faster	FID $\mathcal{N}(0,I)$ 4
FlowFM (representation)	Joint encoder–FM training	$\mathcal{N}(0,I)$ 5 faster inference	HAR Acc/F1 $\mathcal{N}(0,I)$ 6
SpeechFlow (speech)	Masked FM pre-training	state-of-the-art (all tasks)	Outperforms SGMSE+
ExFM (images)	Low-variance analytic targets	$\mathcal{N}(0,I)$ 7 faster (CIFAR-10)	FID $\mathcal{N}(0,I)$ 8
Diff2Flow (T2I/depth)	Diffusion–FM alignment	$\mathcal{N}(0,I)$ 9 speedup (few steps)	FID, AbsRel competitive
MFFM/Flow-Marching (PDE)	Cascaded or latent/temporal FM	$p_1$ 0– $p_1$ 1 efficiency	$p_1$ 2 error reduces $p_1$ 3

7. Theoretical Insights and Limits

The theoretical core of flow matching pre-training lies in the continuity equation, optimal transport, and properties of the conditional plan/interpolant between the source and data distributions. Key advances include:

Derivation of closed-form optimal drifts for Gaussian targets and mixtures (Ryzhakov et al., 2024).
Demonstration of superior error accumulation properties and rollout stability for flow-based models vs. deterministic operators (Chen et al., 23 Sep 2025).
Unified mathematical treatment of consistency, flow matching, and progressive distillation (via two-time map learning) (Boffi et al., 2024).
Variance reduction and convergence guarantees for explicit estimators.

Limitations include the need for analytic or tractable velocity field targets, sensitivity of some approaches to latent feature dimensionality and regularization, and the risk of sample bias when pre-trained diffusion schedules are mismatched during conversion/alignment. Efficient inversion and invertibility properties for certain flow architectures remain an active area of research.

8. Outlook and Further Directions

Flow matching pre-training continues to evolve, with recent focus on:

Further acceleration of training and inference via map-based or one-step schemes (Boffi et al., 2024).
Extension to more complex conditional settings (physics, multi-modal, semantic control).
Integration with self-supervised and geometric pre-text tasks for broad transfer.
Design of theoretical frameworks and implementation recipes for optimal schedule alignment, variance control, and robust generalization across domains.

References:

"Efficient Flow Matching using Latent Variables" (Samaddar et al., 7 May 2025)
"Multi-Fidelity Flow Matching: Cascaded Refinement of PDE Solutions" (Chen et al., 15 May 2026)
"Explicit Flow Matching: On The Theory of Flow Matching Algorithms with Applications" (Ryzhakov et al., 2024)
"Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment" (Schusterbauer et al., 2 Jun 2025)
"High-Performance Self-Supervised Learning by Joint Training of Flow Matching" (Ukita et al., 17 Dec 2025)
"Generative Pre-training for Speech with Flow Matching" (Liu et al., 2023)
"Flow marching for a generative PDE foundation model" (Chen et al., 23 Sep 2025)
"Rethinking Optical Flow from Geometric Matching Consistent Perspective" (Dong et al., 2023)
"CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow" (Weinzaepfel et al., 2022)
"Flow map matching with stochastic interpolants: a mathematical framework for consistency models" (Boffi et al., 2024)