Rectified Flow Matching Diffusion

Updated 24 September 2025

The paper introduces rectified flow matching that reformulates diffusion sampling as a deterministic ODE, enabling near-straight trajectories from noise to data.
It employs a two-stage training process to rectify trajectory deviations, drastically reducing steps while maintaining superior fidelity in tasks like TTS and image synthesis.
Empirical evaluations demonstrate competitive performance with fewer diffusion steps and robust plug-and-play modular transferability across multiple generative applications.

Rectified Flow Matching–Based Diffusion Frameworks constitute a class of generative modeling techniques that fundamentally alter the sampling and training paradigms used in classic diffusion models. Rather than relying on stochastic reverse processes or high-step iterative denoising, these frameworks leverage the principle of “rectified flow” to deterministically map noise to data distributions along (nearly) straight-line trajectories in latent or data space. This approach enables highly efficient sampling, often requiring orders of magnitude fewer steps, while offering competitive or superior fidelity across several generative tasks, including text-to-speech, audio editing, image synthesis, and physics-constrained generation.

1. Mathematical Foundation and ODE Formulation

At the core of rectified flow matching is the reformulation of sampling as solving a deterministic ordinary differential equation (ODE) along an optimized vector field. Given source (noise) and target (data) distributions, rectified flow defines a linear probability path between random samples $x_0 \sim \mathcal{N}(0, I)$ and $x_1$ :

$x_t = t x_1 + (1 - t) x_0$

for $t \in [0, 1]$ . The associated conditional density is Gaussian,

$p_t(x | x_0, x_1) = \mathcal{N}(x | t x_1 + (1 - t) x_0, \sigma^2 I)$

where $\sigma$ is a small noise constant. The key property is that the velocity field

$v_t(x | x_0, x_1) = x_1 - x_0$

is constant along this path. Learning this velocity field is cast as a regression problem using a neural network $u_{\theta}(x_t, y, t)$ , where $y$ refers to conditioning variables (such as text or speaker in TTS). The training objective is to minimize

$\min_{\theta} \mathbb{E}_{t, x_0, x_1, x_t} \|u_{\theta}(x_t, y, t) - (x_1 - x_0)\|^2.$

This “straightening” of the generative trajectory stands in contrast to the highly curved trajectories produced by conventional diffusion models.

After training, new data samples are generated by numerically integrating the ODE from $t=0$ to $t=1$ , starting with $x_0 \sim \mathcal{N}(0, I)$ : $\hat{x}_{(k+1)/N} = \hat{x}_{k/N} + \frac{1}{N} u_{\theta}(\hat{x}_{k/N}, y, k/N).$

2. Rectification and Two-Stage Training Schemes

Despite the principled straight-line trajectory of rectified flow, initial network parameterizations or high-dimensional learning challenges can induce curvature in the sampled trajectory. To “rectify” this, a two-stage procedure is used:

Stage 1: Initial training on independently sampled $(x_0, x_1)$ pairs.
Stage 2 (Rectification / Reflow): The trained model generates new pairs $(x_0', \hat{x}_1)$ by forward integrating the ODE. The model is then retrained on these self-generated endpoint pairs, leading to a trajectory that closely approximates a linear interpolation between noise and data in the latent space.

This approach, termed “rectified flow,” forces the model to learn nearly straight transport paths, reducing the error introduced by ODE discretization and enabling significant step-count reduction during inference.

3. Efficiency, Quality, and Empirical Evaluation

Rectified flow matching dramatically reduces sampling steps in generative modeling:

Method	Steps Required	Degradation in Quality at 2 Steps	Subjective/Objective Performance
GradTTS	100+	Heavy degradation	Lower MOS, unstable MOSNet & MCD
VoiceFlow (RFM)	2–10	Minimal degradation	Higher MOS, stable MOSNet & MCD

In ablation studies (VoiceFlow), removing the rectified flow stage ("–ReFlow") caused significant quality drop (–0.78 and –1.21 CMOS on LJSpeech and LibriTTS, respectively), confirming that rectification yields straighter, more sample-efficient ODE trajectories. MOSNet and Mel-Cepstral Distortion (MCD) further validate that quality remains robust at low sampling counts.

4. Straightness, First-Order Consistency, and Theoretical Generalizations

Subsequent analysis (see (Wang et al., 9 Oct 2024)) clarifies that geometric straightness—while a useful heuristic—is not essential. The primary requirement is that the trajectory be first-order consistent with the local ODE dynamics: $x_t = \alpha_t x_0 + \sigma_t \epsilon$ (e.g., for DDPMs), with the network consistently predicting the correct $\epsilon$ along the ODE path. The central insight is that the use of deterministically matched noise–sample pairs (as produced via a pretrained diffusion model) suffices to make the ODE path first-order accurate, even if mildly curved as in DDPM or Sub-VP models.

Rectified Diffusion generalizes rectified flow matching by forgoing explicit velocity prediction: it retrains a pretrained model on these matched pairs without converting to $v$ -prediction or flow-matching structure. This results in training and inference procedures that are both simpler and more efficient while yielding competitive or superior performance.

5. Transferability, Integration, and Modular Acceleration

Rectified flow matching methods (including PeRFlow) emphasize modularity and transferability. By structuring the velocity update as plug-in weight differences ( $\Delta W$ ), the acceleration can be universally applied across workflows built over the same base model, including ControlNet, IP-Adapter, or AnimateDiff, and extended to multiview or transformer-based 3D pipelines. This plug-and-play acceleration enables lossless transfer of the few-step, high-efficiency sampling benefits without the need for retraining downstream workflows.

Orthogonality to standard diffusion acceleration and distillation techniques enables seamless integration with common frameworks, such as DDIM, DPM-Solver, LoRA, and others.

6. Applications, Limitations, and Future Directions

Rectified flow matching underpins state-of-the-art efficiency in domains including text-to-speech (Guo et al., 2023), personalized image generation (Sun et al., 23 May 2024), high-resolution image synthesis (Schusterbauer et al., 2023), video-to-audio alignment (Wang et al., 1 Jun 2024), and audio/text editing (Gao et al., 17 Sep 2025). Plug-and-play priors based on rectified flow achieve more efficient loss functions for 3D optimization and image editing (Yang et al., 5 Jun 2024).

Notable limitations establish the practical boundaries of current techniques:

The ideal straight-path assumption is only approximately met in piecewise or high-dimensional settings, and small deviations may occur.
In domains with strong distributional curvature or complex optimal transport geometry, phased or progressive rectification (as in ProReflow (Ke et al., 5 Mar 2025)) or local linearization within segmentation windows (PeRFlow) is required.
Guidance with off-the-shelf discriminators (RectifID) relies on ideal flow assumptions and may struggle with highly irregular subjects.

Ongoing research discusses extensions to infinite-dimensional and functional spaces (Zhang et al., 12 Sep 2025), deeper integration with physics constraints (Baldan et al., 10 Jun 2025), and improved diversity via discretized/momentum flow (Ma et al., 10 Jun 2025).

7. Summary Table: Key Features Across Selected Works

Method	Trajectory Structure	Training Stages	Efficiency	Transferability
VoiceFlow	Linear, ODE	Reflow (2-stage)	2–10 steps, high MOS	Domain-specific
PeRFlow	Piecewise linear	Windowed reflow	4 steps, high FID	Universal ( $\Delta W$ )
Rectified Diff.	General ODE/Curved	Direct match/no conversion	1–4 steps, low training	Model-agnostic
ProReflow	Progressive windowed	Multiphase, alignment	4 steps, curriculum	Backbone-agnostic
RectifID	Linear or piecewise	Fixed-point iteration	Training-free guidance	With discriminators

The rectified flow matching–based diffusion paradigm marks a decisive shift in the design of high-quality, efficient generative models. By learning velocity fields that drive nearly optimal transport between noise and data, and iteratively refining these via post-training rectification or phased reflow, these frameworks set benchmarks in sample efficiency, fidelity, and modularity—while offering extensibility to increasingly diverse domains and theoretical settings.