Variational Rectified Flow Matching

Updated 10 November 2025

The paper introduces VRFM, which explicitly models multi-modal velocity fields using latent variables to improve generative sample quality over classic methods.
VRFM employs a variational ELBO combining reconstruction loss and KL regularization to optimize neural ODE-based transport between distributions.
Empirical results on datasets like MNIST, CIFAR-10, and ImageNet demonstrate VRFM’s superior FID, log-likelihood, and sampling efficiency.

Variational Rectified Flow Matching (VRFM) is a generative modeling framework that extends rectified flow matching by explicitly accounting for the inherent multi-modality of velocity vector-fields encountered when transporting samples between probability distributions. Unlike classic rectified flow matching, which collapses diverse optimal transport directions into uni-modal field estimates through mean-squared error regression, VRFM introduces a latent variable to parameterize and sample from a mixture of plausible flows at each point along a distribution-matching trajectory. This approach more faithfully captures the structure of complex, multi-modal data distributions and provides improvements in both sample quality and sampling efficiency across a broad range of synthetic and real-world domains.

1. Theoretical Foundations

Rectified flow matching provides a continuous interpolation mechanism between a source $p_0(x_0)$ and target data distribution $p_1(x_1)$ . For any coupled pair $(x_0, x_1)$ , points along the linear interpolation

$x_t = (1-t)x_0 + t x_1, \qquad t \in [0, 1]$

are moved through the vector field

$v(x_t, t) = x_1 - x_0.$

A neural velocity field $v_\theta(x, t)$ parameterizes this transport, defining the generative process through the ODE

$\frac{dx}{dt} = v_\theta(x, t), \qquad x(0) = x_0,$

with the evolving density $p_t(x)$ governed by the transport PDE

$\frac{\partial \log p_t(x)}{\partial t} = -\mathrm{div}\,v_\theta(x,t).$

In classic flow matching, when two different starting/ending pairs $(x_0, x_1)$ and $(x_0', x_1')$ are linearly interpolated to identical $(x_t, t)$ , the set of valid velocities at that location is multi-modal. However, mean-squared error loss enforces a regression to the mean, erasing this structure and biasing the learned field.

VRFM augments this structure by associating a latent variable $z \sim \mathcal{N}(0, I)$ with each instance. Given $z$ , the velocity field is modeled conditionally as $p(v | x_t, t, z) = \mathcal{N}(v; v_\theta(x_t, t, z), I)$ , producing a Gaussian mixture marginal over $v$ .

2. Variational Formulation and Training Objective

The generative process aims to maximize the marginal likelihood of the "ground-truth" velocity $v^{\text{gt}} = x_1 - x_0$ for a given $(x_t, t)$ : $\log p(v^{\text{gt}} | x_t, t) = \log \int p(v^{\text{gt}} | x_t, t, z) p(z) dz.$ Introducing an encoder $q_\phi(z | x_0, x_1, x_t, t)$ , training proceeds by maximizing the evidence lower bound (ELBO): $\log p(v^{\text{gt}} | x_t, t) \geq \mathbb{E}_{z \sim q_\phi} \left[ \log p(v^{\text{gt}} | x_t, t, z) \right] - D_{\mathrm{KL}}(q_\phi(z | \cdot) \| p(z)).$ Given the Gaussian form, the ELBO reduces (up to a constant) to a reconstruction (MSE) term plus a KL regularizer: $\mathcal{L}(\theta, \phi) = \mathbb{E}_{t, x_0, x_1} \left[ \mathbb{E}_{z \sim q_\phi} \| v_\theta(x_t, t, z) - (x_1 - x_0) \|^2 + \lambda D_{\mathrm{KL}}(q_\phi(z | \cdots) \| p(z)) \right],$ where $\lambda$ modulates the regularization trade-off.

3. Model Architecture and Sampling Procedure

Both the velocity network $v_\theta$ and posterior $q_\phi$ are implemented using neural architectures suitable for the data domain:

Velocity network ( $v_\theta$ ):
- Input: $(x_t, t, z)$ . Time $t$ is encoded (e.g., sinusoidal + projection), and $z$ is processed via a small MLP.
- Backbone: MLP for 1D/2D, convolutional-ResNet for MNIST, UNet/Transformer for higher-dimensional domains (e.g., CIFAR-10, ImageNet).
- Output: mean velocity vector in the data domain.
Posterior network ( $q_\phi$ ):
- Input: any combination of $\{x_0, x_1, x_t, t\}$ .
- Architecture: analogous backbone ending in mean ( $\mu_\phi$ ) and log-variance ( $\sigma_\phi$ ) heads for reparameterization.

The sampling process for generation is as follows:

Sample $x_0 \sim p_0$ and $z \sim \mathcal{N}(0, I)$ .
Numerically integrate $\frac{dx}{dt} = v_\theta(x, t, z)$ from $t = 0$ to $t = 1$ .
Output $x_1 \approx x(t = 1)$ , which is a sample from the model's approximation to $p_1$ .

4. Empirical Results and Benchmarks

VRFM yields superior empirical performance on synthetic and real datasets:

Task	Metric	Classic FM	VRFM
1D Gaussian→bimodal	Log-likelihood (PW)	Lower	Higher
2D circle transport	Likelihood/qualitative	Lower	Higher
MNIST (28×28)	FID vs. NFE	Higher	Lower
CIFAR-10 (32×32)	FID (NFE=5)	≈35.5	≈28.9
CIFAR-10 (adaptive)	FID	3.66	3.55
ImageNet (256²)	FID-50K@400K	17.2	14.6
ImageNet (256²)	FID-50K@800K	13.1	10.6

On MNIST, VRFM exhibits smooth 2D manifolds in latent space, allowing manipulation of digit style via $z$ and content diversity via $x_0$ . On CIFAR-10 and ImageNet, VRFM improves FID at both fixed and adaptive NFE, and supports controllable style-content synthesis by conditioning on $z$ . With classifier-free guidance on ImageNet, V-SiT-XL improves FID further (e.g., 3.22 vs. 3.43 at 800K steps).

5. Algorithmic Implementation

Training Algorithm

for minibatch in dataset:
    x0 = sample(p0)        # Source samples
    x1 = minibatch         # Target samples
    t = Uniform(0, 1)
    xt = (1 - t) * x0 + t * x1
    v_gt = x1 - x0

    z ~ q_phi(z | x0, x1, xt, t)    # Encoder outputs (mu_phi, sigma_phi)
    l_rec = ||v_theta(xt, t, z) - v_gt||^2
    l_kl = KL(q_phi || N(0, I))
    loss = l_rec + lambda * l_kl
    update(theta, phi, loss)

Inference Algorithm

x0 = sample(p0)
z = sample(N(0, I))
x = x0
for t in [0, ..., 1]:
    dxdt = v_theta(x, t, z)
    x = x + dxdt * dt       # Euler or Dormand–Prince solvers
x1 = x

6. Hyperparameter Regimes and Ablations

Key ablation findings include:

KL-weight ( $\lambda$ ): Typical values are $2 \times 10^{-3}$ to $5 \times 10^{-3}$ for CIFAR, $1 \times 10^{-3}$ for MNIST.
Posterior conditioning: Best performance when conditioning on both $x_0$ and $x_1$ , or $[x_1]$ + $t$ .
Fusion mechanisms (CIFAR-10): Adaptive Norm (adding $z$ to time embedding) and Bottleneck Sum (injecting $z$ at lowest U-Net resolution) both effective.
Latent dimension: 1D/2D for small images; 768 for CIFAR-10; 1152 for ImageNet.
Batch size: 256–512 for images; 1,000 for synthetic.
Optimization: AdamW, learning rate $1\textrm{e}{-3}$ – $2\textrm{e}{-4}$ , weight decay $1\textrm{e}{-2}$ .
Training steps: 20K (synthetic), 100K (MNIST), 600K (CIFAR-10), 800K (ImageNet).
Encoder size: Up to 6.7% size retains most of the performance.

7. Significance and Extensions

VRFM addresses the multi-modality of transport vector-fields in neural ODE-based generative modeling. By leveraging a variational ELBO on Gaussian mixture velocity predictions, VRFM provides:

Higher sample quality (lower FID and higher log-likelihood) than classic flow matching, especially at low NFE regimes, streamlining sampling for practical deployment.
A simple latent-based style-content control mechanism: fixing $z$ controls style, $x_0$ modulates content.
Applicability to high-dimensional generative tasks, including complex image synthesis on benchmarks such as CIFAR-10 and ImageNet.

The use of VRFM as a building block in multi-stage and multimodal generative systems, such as audio synthesis pipelines for text-to-room impulse response generation (Vosoughi et al., 25 Oct 2025), demonstrates its utility as an ODE-based generative operator in latent space. Potential future directions include structured latent variables for more expressive control, and integration with non-linear or curved interpolation trajectories to better model the geometry of intricate data manifolds.

Summary Table: Conceptual Comparison

Feature	Classic Rectified Flow Matching	Variational Rectified Flow Matching
Velocity Modeling	Uni-modal (mean-squared error)	Multi-modal (latent-indexed)
Loss	MSE	Variational ELBO
Sampling	ODE – one velocity per location	ODE with latent-sampled velocities
Style Control	Not inherent	Direct via latent $z$
Empirical Results	FID/log-likelihood: baseline	FID/log-likelihood: improved
Guidance	Not explicit	Compatible with classifier-free

The introduction of explicit latent variables and a variational training regimen in VRFM fundamentally enhances the ability of flow-matching models to replicate complex, multi-modal data distributions and supports new advances in efficient conditional generative modeling (Guo et al., 13 Feb 2025).

PDF Markdown Chat (Pro)

References (2)

PromptReverb: Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching (2025)

Variational Rectified Flow Matching (2025)

Follow Topic

Get notified by email when new papers are published related to Variational Rectified Flow Matching.