Conditional Rectified Flow (CRF)

Updated 10 November 2025

Conditional Rectified Flow (CRF) is a deterministic generative modeling framework that transforms complex conditional distributions into nearly linear ODE flows using learnable, time-dependent velocity fields.
Its methodology replaces stochastic diffusion with straightened trajectories via a flow-matching loss and geometry-aware predictor–corrector schemes, ensuring computational efficiency and stability.
CRF is applied across domains such as text-to-image synthesis, speech generation, and biomedical imaging, offering significant speedups and enhanced sample quality compared to traditional diffusion models.

Conditional Rectified Flow (CRF) is a deterministic generative modeling framework that enables efficient, high-fidelity modeling of complex conditional distributions by transforming the stochastic transport characteristic of diffusion and score-based models into nearly linear, straightened ordinary differential equation (ODE) flows. Conditioned on auxiliary variables (e.g., text, images, gene expression, PDE solutions), CRF establishes deterministic ODEs or vector fields whose sampled trajectories connect structured priors to data manifolds in a stable and computationally efficient manner. This methodology has been applied to diverse domains, including text-to-image synthesis, speech generation, biomedical imaging, illumination enhancement, and fluid dynamics, offering significant speedups and robustness compared to conventional diffusion-based approaches.

1. Mathematical Foundation of Conditional Rectified Flow

CRF generalizes probabilistic transport via ODEs parameterized by learnable, time-dependent velocity fields $v_\theta(x, t; c)$ , where $x$ is the sample, $t\in[0, 1]$ denotes the synthetic time index, and $c$ is the conditioning signal. The essential formulation is:

$\frac{dx_t}{dt} = v_\theta(x_t, t; c),\quad x_0 \sim p_0, \quad x_1 \sim p_1,$

where $p_0$ is a tractable prior (e.g., standard normal) and $p_1$ is the data distribution or conditional target.

The velocity field is trained to approximate the “straight-line” barycentric or displacement velocity connecting source and target (e.g., $x_1 - x_0$ for linear interpolation). The flow-matching loss minimizes the mean squared deviation between $v_\theta$ and the ground-truth velocity:

$\mathcal{L}_\text{CFM}(\theta) = \mathbb{E}_{t, x_0, x_1, c}\; \| v_\theta (x_t, t; c) - (x_1 - x_0) \|^2,$

where $x_t = (1-t) x_0 + t x_1$ . Conditioning $c$ can be drawn from any auxiliary source (e.g., text embedding, image, prior state), and is injected via concatenation, FiLM, or attention-based encoders depending on the application (Armegioiu et al., 3 Jun 2025, Wang et al., 31 Oct 2025, Wei et al., 4 Nov 2025, Guo et al., 2023).

In advanced variants, rectification or time-warping creates parametrizations so that the velocity norm is roughly constant in synthetic time, further straightening trajectories and reducing discretization error.

2. Predictor–Corrector Inference and Geometry-aware Conditioning

When CRF is used for conditional generation with strong guidance or classifier-free modulation, naïve application may lead to off-manifold drift. For example, in text-to-image models, classifier-free guidance (CFG) can push samples away from the learned data support, manifesting as artifacts or semantic alignment failures. To address this, Rectified-CFG++ introduces an adaptive geometry-aware predictor–corrector scheme:

Predictor: Propagate the sample a half-step using the conditional velocity field.
Corrector: At the intermediate position, interpolate between conditional and unconditional velocities using a weight schedule $\alpha(t)$ .
Update: Advance with the corrected velocity, maintaining the trajectory's proximity to the tangent space of the data manifold.

Formally, at time $t$ and step size $\Delta t$ : $\tilde{x}_{t - \frac{\Delta t}{2}} = x_t + \frac{\Delta t}{2} v_t^c(x_t),$

$\hat{v}_t = v_t^c(x_t) + \alpha(t) \bigl( v^c_{t - \frac{\Delta t}{2}}(\tilde{x}_{t - \frac{\Delta t}{2}}) - v^u_{t - \frac{\Delta t}{2}}(\tilde{x}_{t - \frac{\Delta t}{2}}) \bigr ),$

$x_{t-\Delta t} = x_t + \Delta t\, \hat{v}_t.$

Here $v^c$ and $v^u$ denote the conditional and unconditional velocities, respectively; $\alpha(t)$ (often of the form $\lambda_{\max}(1-t)^\gamma$ ) calibrates the guidance strength. This structure provably bounds per-step manifold deviation under mild regularity assumptions and ensures marginal consistency: as $\Delta t \to 0$ and $\alpha(t) \to 0$ , the process reduces to pure conditional flow (Saini et al., 9 Oct 2025).

3. Model Architectures and Training Strategies

Network Parameterizations

CRF models employ U-Net backbones (with residual and attention blocks) or analogous architectures tailored to the data modality, with the conditioning variable integrated via concatenation, FiLM layers, or cross-attention. Notable architectural details include:

SR3-style U-Nets for imaging tasks, where $t$ and difference maps (e.g., $L_n-L_l$ in illumination models) are injected into each block (Wei et al., 4 Nov 2025).
Attention-based encoders for complex conditioning, such as RNA embedding in gene-to-image translation, with low-rank gene relation modules and global gene attention (Wang et al., 31 Oct 2025).
Feature-wise temporal modulation using learnable positional embeddings, especially in temporal or scale-ordered domains like PDEs or speech (Armegioiu et al., 3 Jun 2025, Guo et al., 2023).

Training Objectives

The central loss is the flow-matching loss described above, supplemented by:

Consistency/Shortcut regularizers to promote stability in one-step or multi-step integration by penalizing inconsistency between velocities across infinitesimal $t$ -steps (Wei et al., 4 Nov 2025).
Auxiliary losses appropriate to the target: duration prediction (speech), L1 regularization (gene sparsity), SSIM/content losses (imaging), and spatial graph losses (cell morphology).

For rectified flows, a two-stage training process is common: initial training on data-based endpoints, followed by retraining/recalibration on the network's own generated endpoint pairs to further straighten the flow (Guo et al., 2023).

4. Inference, Integration, and Computational Considerations

Inference proceeds by numerically integrating the trained ODE $\frac{dx_t}{dt} = v_\theta(x_t, t; c)$ :

Solvers: Fixed-step (Euler, Runge–Kutta) or adaptive high-order solvers (Dormand–Prince 5th-order, DPM-Solver) are employed. CRF's nearly linear paths allow accurate generation with very few steps, typically $N=2$ to $20$ depending on the domain and required fidelity (Armegioiu et al., 3 Jun 2025, Guo et al., 2023, Saini et al., 9 Oct 2025).
Complexity: Predictor–corrector schemes (e.g., Rectified-CFG++) typically require two network (velocity) evaluations per inference step versus one for standard ODE integration, but this is offset by requiring many fewer steps—permitting significant wallclock speedups and lower overall FLOPs.
Stability: The approach is robust to larger step sizes due to trajectory straightness and regularization. Experiments show that unstable behavior or catastrophic drift, present in naive extrapolation or over-guided CFG, is avoided through bounded manifold deviation per integration step (Saini et al., 9 Oct 2025).

5. Domain-specific Applications

The CRF methodology has been extended to multiple domains:

Domain	Conditioning	Notable System/Method	Benchmark Gains
Text-to-Image Synthesis	Text prompt	Rectified-CFG++	FID/CLIP gains on MS-COCO, LAION-Aesthetic, T2I-CompBench; artifact, text alignment improvement (Saini et al., 9 Oct 2025)
Text-to-Speech	Phone-level text	VoiceFlow	MOS $+0.82$ (N=2), fewer steps vs. diffusion; MCD, MOSNet gains (Guo et al., 2023)
Low-light Enhancement	Image pairs	IllumFlow	PSNR/SSIM/LPIPS gains on LOL v1/v2, MEF; fast, exposure-adjustable enhancement (Wei et al., 4 Nov 2025)
Fluid Modeling	Low-res/Noisy states	ReFlow	Mean/std/Wasserstein errors match or beat 128-step diffusion with only $8$ steps (Armegioiu et al., 3 Jun 2025)
Gene-to-Image Translate	RNA expression	GeneFlow	FID $20.7$ vs. $171.1$ (diffusion); SSIM, FeatureDist, spatial/biological metrics up (Wang et al., 31 Oct 2025)

Domain-specific adaptations include decomposing input into physically- or semantically-meaningful components (as in Retinex decomposition for LLIE) and injecting sophisticated encoding pipelines for high-dimensional conditional signals (as in GeneFlow).

6. Performance, Limitations, and Theoretical Guarantees

Quantitative evaluations across domains consistently demonstrate that CRF yields faster, more stable, and often higher-fidelity samples compared to classical diffusion counterparts, especially as the inference step count is decreased. Notably:

Rectified-CFG++ achieves FID and CLIP-Score improvements on text-to-image tasks; step budgets as low as $10$ (vs. $28+$ for baseline CFG) suffice for high fidelity (Saini et al., 9 Oct 2025).
VoiceFlow remains intelligible and natural up to $N=2$ steps, with ablation showing a $-0.78$ to $-1.21$ MOS drop if rectification is omitted (Guo et al., 2023).
IllumFlow achieves continuous, exposure-adjustable enhancement at sub-0.1s runtimes, outperforming diffusion and Retinex methods on all tested benchmarks (Wei et al., 4 Nov 2025).
In fluid modeling, ReFlow attains up to $22\times$ inference speedup for comparable error to 128-step diffusion models (Armegioiu et al., 3 Jun 2025).
GeneFlow achieves $3-6\times$ lower FID values relative to diffusion, while maintaining morphological and biological structure (Wang et al., 31 Oct 2025).

Theoretical analyses guarantee that, under mild Lipschitz and manifold alignment conditions, CRF steps remain within a bounded “tubular neighborhood” of the target data manifold, and that marginal consistency with the true conditional law is preserved as the integration discretization vanishes (Saini et al., 9 Oct 2025).

Failure cases include model capacity limitations (e.g., missed secondary elements in crowded scenes or under-specified conditional targets) rather than the dynamics of the flow itself.

7. Future Directions, Open Challenges, and Practical Recommendations

Active research trajectories include:

Extension to SDEs and diffusion-based stochastic models, as most current CRF frameworks target deterministic ODE flows.
Video and sequential data generation leveraging multi-time-scale and hierarchical rectified flows.
Preference-based and learned guidance weighting integrations.
End-to-end distributional robustness bounds (e.g., closed-form KL divergence control over the entire sampling path) remain theoretically open.
Practical deployment recommendations: guidance schedule tuning (e.g., $\lambda_{\max}$ in $[0.5, 1.0]$ ), step size adjustment, and CRF module separation for multi-stage enhancement workflows.

A plausible implication is that CRF provides an effective, drop-in, training-free upgrade for any flow-matching generator, with the capacity for deterministic, invertible, and artifact-free mapping between disparate data modalities under conditional requirements. The hybridization of deterministic ODE transport with flexible, learnable velocity fields continues to enable advances in generative modeling speed, stability, and cross-modal translation.