Rectified Flow Matching

Updated 1 November 2025

Rectified Flow Matching (RFM) is a deterministic approach for training generative models via straight-line ODE flows that optimally transport data from a noise source to a target distribution.
It leverages regression of neural velocity fields towards the linear path between source and target, enabling efficient sampling and high-fidelity generation in applications like image synthesis and text-to-speech.
Enhanced variants such as Variational RFM and Hierarchical RFM extend its capabilities to handle multimodal data, manifold modeling, and one-step synthesis with strong empirical performance.

Rectified Flow Matching (RFM) is a deterministic framework for training generative models that directly learns the optimal transport between source and target distributions via straight-line (rectified) ODE flows. By defining and regressing vector fields that guide these transport paths, RFM delivers notable improvements in sampling efficiency, trajectory linearity, and model simplicity across domains such as image synthesis, text-to-speech, audio, and manifold generative modeling. This approach has provided state-of-the-art results in a range of modalities and inspired significant theoretical and methodological advances.

1. Mathematical Framework of Rectified Flow Matching

RFM centers on constructing deterministic transport maps from a simple source distribution (e.g., Gaussian noise) $p_0(x_0)$ to a complex data distribution $p_1(x_1)$ by parameterizing a linear ODE trajectory: $\frac{dx_t}{dt} = v_\theta(x_t, t)$ with straight-line interpolation: $x_t = (1-t) x_0 + t x_1, \quad t \in [0, 1]$ The optimal velocity field for any $(x_0, x_1)$ pair is simply $v^*(x_t, t) = x_1 - x_0$ ; the model regresses the neural velocity field $v_\theta(x_t, t)$ to this target via a mean squared error (MSE) loss: $\mathcal{L}_{\mathrm{RFM}}(\theta) = \mathbb{E}_{x_0, x_1, t}\left[ \| v_\theta(x_t, t) - (x_1 - x_0) \|^2 \right]$ In the conditional case, e.g., text-to-image or speech synthesis, additional conditioning variables $c$ (text embedding, acoustic features, etc.) are incorporated: $\mathcal{L}_{\mathrm{RFM}}(\theta) = \mathbb{E}_{x_0, x_1, t, c}\left[ \| v_\theta(x_t, t, c) - (x_1 - x_0) \|^2 \right]$ By integrating the learned $v_\theta$ backward from noise to data, or forward for sampling, these models generate diverse, high-fidelity samples in a small (often single-digit) number of ODE steps.

2. Theoretical Properties and Extensions

Deterministic, Straight Trajectories

Unlike diffusion models, which stochastically denoise samples along curved, random trajectories, RFM enforces straight and non-intersecting paths between paired $(x_0, x_1)$ samples, resulting in highly efficient sampling and improved fidelity per step (Guo et al., 2023, Wang et al., 10 Apr 2025).

Conditional RFM and Optimal Transport

RFM relies on "conditional flow matching": for each sample, the velocity field is regressed towards the vector between noise and data. Mathematical analysis (Hertrich et al., 26 May 2025) shows that RFM's linear interpolation only produces the true optimal transport (OT) paths under restrictive conditions (e.g., when distributions are jointly Gaussian and rectifiable), and that the gradient-of-potential (OT) constraint does not guarantee exact OT in general, especially for disconnected or multimodal supports.

Standard RFM regresses to the first moment—averaging all possible velocities through a given $(x_t, t)$ . This uni-modal approximation can lead to inefficient, curved flows in multi-modal settings (Guo et al., 13 Feb 2025). To address this, Variational Rectified Flow Matching (V-RFM) introduces a latent variable $z$ , allowing the velocity field $v_\theta(x_t, t, z)$ to model the full multimodal structure of the ground-truth velocity distribution at each spacetime location.

Hierarchical and Batch-coupled Extensions

Hierarchical Rectified Flow Matching (HRF) introduces ODE hierarchies that allow modeling not only position but also distributions over velocity and higher-order derivatives (Zhang et al., 17 Jul 2025). The introduction of mini-batch OT couplings at each level allows flexible control over velocity-field multimodality and supports more efficient, straight paths, with complexity tunable by batch size.

Manifold and Constrained Domain RFM

Riemannian Flow Matching (RFM on general geometries, (Chen et al., 2023)) extends the framework to data on manifolds (e.g., spheres, tori, meshes) by crafting ODEs along geodesic (or spectral) distances with respect to the manifold's metric. Reflected Flow Matching (Xie et al., 26 May 2024) augments RFM to enforce strict domain boundaries (e.g., pixels in $[0,255]$ ), ensuring all generated samples remain within physically meaningful constraints and eliminating boundary violation—critical for image synthesis or scientific data on constrained domains.

3. Practical Implementations and Architectural Innovations

A variety of architectural refinements have been proposed:

SlimSpeech and VoiceFlow: RFM-based TTS systems employ rectified flow decoders, lightweight encoders (e.g., depthwise separable convolutions), and aggressive parameter sharings/distillation to achieve near SOTA quality in very few ODE evaluations (~1 step), with significant reductions in model size and increased inference speed (Guo et al., 2023, Wang et al., 10 Apr 2025).
AudioTurbo and Frieren: Efficient text-to-audio and video-to-audio models leverage RFM with distillation from large pre-trained diffusion models, deterministic ODE-guided flows, and conditional transformer-based vector field estimators for temporal alignment and one-step/high-speed synthesis (Zhao et al., 28 May 2025, Wang et al., 1 Jun 2024).
PromptReverb: Combines RFM with conditional diffusion transformers in the latent domain, conditioned on multimodal input for high-fidelity, text-controllable room impulse response generation (Vosoughi et al., 25 Oct 2025).
RFM-Editing and FlowSep: RFM is adapted for text-guided audio editing and language-guided source separation, encoding instructions, original/masked features, or multimodal mixtures in the conditional vector field and operating within VAE latent spaces (Gao et al., 17 Sep 2025, Yuan et al., 11 Sep 2024).

4. Sampling Efficiency, Model Quality, and Empirical Benchmarks

RFM-based models consistently outperform diffusion and earlier flow models along key axes of efficiency and fidelity:

Sampling Steps: RFM, combined with reflow, annealing, and distillation, achieves competitive or superior performance with single-digit sampling steps (1–10), far faster than diffusion-based methods—which typically require 50–200 steps (Wang et al., 10 Apr 2025, Zhao et al., 28 May 2025).
Quality and Naturalness: On core evaluation metrics such as MOS, FAD, FD (audio), FID (images), and temporal alignment (audio/video), RFM delivers scores equivalent to or surpassing much larger/slow baselines. For example: SlimSpeech (1 step, 5.5M params) achieves FAD 0.693 and MOS 3.71 (on par with models 5x larger) (Wang et al., 10 Apr 2025); AudioTurbo achieves better text-audio alignment and quality than prior TTA models at an order of magnitude fewer steps (Zhao et al., 28 May 2025).
Ablations: Both annealing reflow and multi-step distillation are empirically shown to be critical for the transition from multi-step to one-step inference without loss of quality.

Model	Steps	Params	MOS	FAD	FD	RTF
SlimSpeech	1	5.48M	3.71	0.693	0.806	.0139
ReFlow-TTS	1	27.09M	3.57	1.405	4.257	.0477
FastSpeech2	1	28.83M	3.55	2.164	6.025	.0759

5. Methodological Enhancements and Specializations

Reflow and Annealing

Reflow retrains on teacher-generated samples to further straighten the student’s ODE paths and support strong distillation, while annealing reflow gradually transitions the student from random noisy inputs to the (rectified) teacher outputs, ensuring a smooth, stable transfer even with aggressive capacity reduction (Wang et al., 10 Apr 2025).

Distillation and One-Step Synthesis

Flow-guided distillation enables small student models to absorb the (complex, straight) ODE transport learned by a large teacher, permitting direct one-step mapping, reinforced by two-step regularization to prevent overfitting (Wang et al., 10 Apr 2025, Wang et al., 1 Jun 2024).

Preference Optimization and Control

For motion and controlled image generation (e.g., MotionFLUX, FlowChef), variants of RFM employ explicit preference or classifier-guided losses, and a theoretical analysis of vector field dynamics shows that gradient-free, deterministic steering suffices for accurate and efficient control, in contrast to diffusion models that require expensive iterative backpropagation through the ODE (Gao et al., 27 Aug 2025, Patel et al., 27 Nov 2024).

6. Limitations, Controversies, and Relation to Optimal Transport

The equivalence of RFM and optimal transport—specifically, whether minimizing the RFM loss with a gradient/potential vector field constraint yields the exact OT map—is shown to hold only under stringent assumptions (joint normality/rectifiability, connected supports, etc.) (Hertrich et al., 26 May 2025). Explicit counterexamples reveal that loss convergence to zero and gradient constraint are insufficient for general OT recovery; thus, use of RFM for true OT computation in generative models requires caution and verification of underlying hypotheses.

RFM’s deterministic straightness, while lending efficiency, can under-express multimodal velocity fields unless enhanced via variational (V-RFM), hierarchical, or mini-batch coupled learning (Guo et al., 13 Feb 2025, Zhang et al., 17 Jul 2025). For certain domains (e.g., cryptic multimodal image classes, or scientific data with manifold structure), rectified geodesic flows or explicit multi-modal modeling are critical for performance.

7. Impact and Future Directions

Rectified Flow Matching has enabled a suite of generative frameworks achieving strong performance in terms of speed, memory efficiency, sample quality, and domain flexibility, especially on edge and real-time applications. The approach is particularly amenable to architectures with limited computational resources and has demonstrated value in text-to-speech, audio, image, video, and manifold learning tasks.

The ongoing research focuses on extending RFM to high-dimensional, multimodal, and highly structured domains via variational modeling, optimal couplings, and efficient manifold- or constraint-respecting formulations. Further investigation into unifying RFM with explicit optimal transport theory and addressing the limitations of uni-modal vector field regression is an open and active topic.

References: For specific mathematical formulations and empirical findings, see (Guo et al., 2023, Wang et al., 10 Apr 2025, Guo et al., 13 Feb 2025, Hertrich et al., 26 May 2025, Zhang et al., 17 Jul 2025, Chen et al., 2023, Xie et al., 26 May 2024, Gao et al., 17 Sep 2025, Yuan et al., 11 Sep 2024, Wang et al., 1 Jun 2024, Patel et al., 27 Nov 2024, Wang et al., 18 Mar 2025, Gao et al., 27 Aug 2025, Vosoughi et al., 25 Oct 2025), and (Zhao et al., 28 May 2025).