Papers
Topics
Authors
Recent
2000 character limit reached

Variational Rectified Flow Matching

Updated 10 November 2025
  • The paper introduces VRFM, which explicitly models multi-modal velocity fields using latent variables to improve generative sample quality over classic methods.
  • VRFM employs a variational ELBO combining reconstruction loss and KL regularization to optimize neural ODE-based transport between distributions.
  • Empirical results on datasets like MNIST, CIFAR-10, and ImageNet demonstrate VRFM’s superior FID, log-likelihood, and sampling efficiency.

Variational Rectified Flow Matching (VRFM) is a generative modeling framework that extends rectified flow matching by explicitly accounting for the inherent multi-modality of velocity vector-fields encountered when transporting samples between probability distributions. Unlike classic rectified flow matching, which collapses diverse optimal transport directions into uni-modal field estimates through mean-squared error regression, VRFM introduces a latent variable to parameterize and sample from a mixture of plausible flows at each point along a distribution-matching trajectory. This approach more faithfully captures the structure of complex, multi-modal data distributions and provides improvements in both sample quality and sampling efficiency across a broad range of synthetic and real-world domains.

1. Theoretical Foundations

Rectified flow matching provides a continuous interpolation mechanism between a source p0(x0)p_0(x_0) and target data distribution p1(x1)p_1(x_1). For any coupled pair (x0,x1)(x_0, x_1), points along the linear interpolation

xt=(1t)x0+tx1,t[0,1]x_t = (1-t)x_0 + t x_1, \qquad t \in [0, 1]

are moved through the vector field

v(xt,t)=x1x0.v(x_t, t) = x_1 - x_0.

A neural velocity field vθ(x,t)v_\theta(x, t) parameterizes this transport, defining the generative process through the ODE

dxdt=vθ(x,t),x(0)=x0,\frac{dx}{dt} = v_\theta(x, t), \qquad x(0) = x_0,

with the evolving density pt(x)p_t(x) governed by the transport PDE

logpt(x)t=divvθ(x,t).\frac{\partial \log p_t(x)}{\partial t} = -\mathrm{div}\,v_\theta(x,t).

In classic flow matching, when two different starting/ending pairs (x0,x1)(x_0, x_1) and (x0,x1)(x_0', x_1') are linearly interpolated to identical (xt,t)(x_t, t), the set of valid velocities at that location is multi-modal. However, mean-squared error loss enforces a regression to the mean, erasing this structure and biasing the learned field.

VRFM augments this structure by associating a latent variable zN(0,I)z \sim \mathcal{N}(0, I) with each instance. Given zz, the velocity field is modeled conditionally as p(vxt,t,z)=N(v;vθ(xt,t,z),I)p(v | x_t, t, z) = \mathcal{N}(v; v_\theta(x_t, t, z), I), producing a Gaussian mixture marginal over vv.

2. Variational Formulation and Training Objective

The generative process aims to maximize the marginal likelihood of the "ground-truth" velocity vgt=x1x0v^{\text{gt}} = x_1 - x_0 for a given (xt,t)(x_t, t): logp(vgtxt,t)=logp(vgtxt,t,z)p(z)dz.\log p(v^{\text{gt}} | x_t, t) = \log \int p(v^{\text{gt}} | x_t, t, z) p(z) dz. Introducing an encoder qϕ(zx0,x1,xt,t)q_\phi(z | x_0, x_1, x_t, t), training proceeds by maximizing the evidence lower bound (ELBO): logp(vgtxt,t)Ezqϕ[logp(vgtxt,t,z)]DKL(qϕ(z)p(z)).\log p(v^{\text{gt}} | x_t, t) \geq \mathbb{E}_{z \sim q_\phi} \left[ \log p(v^{\text{gt}} | x_t, t, z) \right] - D_{\mathrm{KL}}(q_\phi(z | \cdot) \| p(z)). Given the Gaussian form, the ELBO reduces (up to a constant) to a reconstruction (MSE) term plus a KL regularizer: L(θ,ϕ)=Et,x0,x1[Ezqϕvθ(xt,t,z)(x1x0)2+λDKL(qϕ(z)p(z))],\mathcal{L}(\theta, \phi) = \mathbb{E}_{t, x_0, x_1} \left[ \mathbb{E}_{z \sim q_\phi} \| v_\theta(x_t, t, z) - (x_1 - x_0) \|^2 + \lambda D_{\mathrm{KL}}(q_\phi(z | \cdots) \| p(z)) \right], where λ\lambda modulates the regularization trade-off.

3. Model Architecture and Sampling Procedure

Both the velocity network vθv_\theta and posterior qϕq_\phi are implemented using neural architectures suitable for the data domain:

  • Velocity network (vθv_\theta):
    • Input: (xt,t,z)(x_t, t, z). Time tt is encoded (e.g., sinusoidal + projection), and zz is processed via a small MLP.
    • Backbone: MLP for 1D/2D, convolutional-ResNet for MNIST, UNet/Transformer for higher-dimensional domains (e.g., CIFAR-10, ImageNet).
    • Output: mean velocity vector in the data domain.
  • Posterior network (qϕq_\phi):
    • Input: any combination of {x0,x1,xt,t}\{x_0, x_1, x_t, t\}.
    • Architecture: analogous backbone ending in mean (μϕ\mu_\phi) and log-variance (σϕ\sigma_\phi) heads for reparameterization.

The sampling process for generation is as follows:

  1. Sample x0p0x_0 \sim p_0 and zN(0,I)z \sim \mathcal{N}(0, I).
  2. Numerically integrate dxdt=vθ(x,t,z)\frac{dx}{dt} = v_\theta(x, t, z) from t=0t = 0 to t=1t = 1.
  3. Output x1x(t=1)x_1 \approx x(t = 1), which is a sample from the model's approximation to p1p_1.

4. Empirical Results and Benchmarks

VRFM yields superior empirical performance on synthetic and real datasets:

Task Metric Classic FM VRFM
1D Gaussian→bimodal Log-likelihood (PW) Lower Higher
2D circle transport Likelihood/qualitative Lower Higher
MNIST (28×28) FID vs. NFE Higher Lower
CIFAR-10 (32×32) FID (NFE=5) ≈35.5 ≈28.9
CIFAR-10 (adaptive) FID 3.66 3.55
ImageNet (256²) FID-50K@400K 17.2 14.6
ImageNet (256²) FID-50K@800K 13.1 10.6

On MNIST, VRFM exhibits smooth 2D manifolds in latent space, allowing manipulation of digit style via zz and content diversity via x0x_0. On CIFAR-10 and ImageNet, VRFM improves FID at both fixed and adaptive NFE, and supports controllable style-content synthesis by conditioning on zz. With classifier-free guidance on ImageNet, V-SiT-XL improves FID further (e.g., 3.22 vs. 3.43 at 800K steps).

5. Algorithmic Implementation

Training Algorithm

1
2
3
4
5
6
7
8
9
10
11
12
for minibatch in dataset:
    x0 = sample(p0)        # Source samples
    x1 = minibatch         # Target samples
    t = Uniform(0, 1)
    xt = (1 - t) * x0 + t * x1
    v_gt = x1 - x0

    z ~ q_phi(z | x0, x1, xt, t)    # Encoder outputs (mu_phi, sigma_phi)
    l_rec = ||v_theta(xt, t, z) - v_gt||^2
    l_kl = KL(q_phi || N(0, I))
    loss = l_rec + lambda * l_kl
    update(theta, phi, loss)

Inference Algorithm

1
2
3
4
5
6
7
x0 = sample(p0)
z = sample(N(0, I))
x = x0
for t in [0, ..., 1]:
    dxdt = v_theta(x, t, z)
    x = x + dxdt * dt       # Euler or Dormand–Prince solvers
x1 = x

6. Hyperparameter Regimes and Ablations

Key ablation findings include:

  • KL-weight (λ\lambda): Typical values are 2×1032 \times 10^{-3} to 5×1035 \times 10^{-3} for CIFAR, 1×1031 \times 10^{-3} for MNIST.
  • Posterior conditioning: Best performance when conditioning on both x0x_0 and x1x_1, or [x1][x_1] + tt.
  • Fusion mechanisms (CIFAR-10): Adaptive Norm (adding zz to time embedding) and Bottleneck Sum (injecting zz at lowest U-Net resolution) both effective.
  • Latent dimension: 1D/2D for small images; 768 for CIFAR-10; 1152 for ImageNet.
  • Batch size: 256–512 for images; 1,000 for synthetic.
  • Optimization: AdamW, learning rate 1e31\textrm{e}{-3}2e42\textrm{e}{-4}, weight decay 1e21\textrm{e}{-2}.
  • Training steps: 20K (synthetic), 100K (MNIST), 600K (CIFAR-10), 800K (ImageNet).
  • Encoder size: Up to 6.7% size retains most of the performance.

7. Significance and Extensions

VRFM addresses the multi-modality of transport vector-fields in neural ODE-based generative modeling. By leveraging a variational ELBO on Gaussian mixture velocity predictions, VRFM provides:

  • Higher sample quality (lower FID and higher log-likelihood) than classic flow matching, especially at low NFE regimes, streamlining sampling for practical deployment.
  • A simple latent-based style-content control mechanism: fixing zz controls style, x0x_0 modulates content.
  • Applicability to high-dimensional generative tasks, including complex image synthesis on benchmarks such as CIFAR-10 and ImageNet.

The use of VRFM as a building block in multi-stage and multimodal generative systems, such as audio synthesis pipelines for text-to-room impulse response generation (Vosoughi et al., 25 Oct 2025), demonstrates its utility as an ODE-based generative operator in latent space. Potential future directions include structured latent variables for more expressive control, and integration with non-linear or curved interpolation trajectories to better model the geometry of intricate data manifolds.

Summary Table: Conceptual Comparison

Feature Classic Rectified Flow Matching Variational Rectified Flow Matching
Velocity Modeling Uni-modal (mean-squared error) Multi-modal (latent-indexed)
Loss MSE Variational ELBO
Sampling ODE – one velocity per location ODE with latent-sampled velocities
Style Control Not inherent Direct via latent zz
Empirical Results FID/log-likelihood: baseline FID/log-likelihood: improved
Guidance Not explicit Compatible with classifier-free

The introduction of explicit latent variables and a variational training regimen in VRFM fundamentally enhances the ability of flow-matching models to replicate complex, multi-modal data distributions and supports new advances in efficient conditional generative modeling (Guo et al., 13 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Variational Rectified Flow Matching.