Hierarchical Rectified Flow Matching

Updated 10 November 2025

The paper introduces a framework that extends classic RF by hierarchically coupling multiple ODEs across location, velocity, and higher-order domains.
It achieves straighter, intersecting integration paths, significantly reducing the number of neural function evaluations and improving sample efficiency.
Mini-batch couplings using optimal transport further refine modality control, enhancing quality in modeling multi-modal data distributions.

Hierarchical Rectified Flow Matching is a generative modeling framework that extends classic Rectified Flow (RF) matching by hierarchically coupling multiple ordinary differential equations (ODEs) across location, velocity, and higher-order domains such as acceleration and jerk. The approach models the full distribution of random velocity and its higher-order derivatives at each space–time point, supporting intersecting integration paths and straighter flows through the sample space. This structural improvement reduces the number of neural function evaluations (NFEs) necessary for data generation and enables effective modeling of multi-modal data distributions. Hierarchical rectified flows (HRF) may further be enhanced by mini-batch couplings, allowing control over the modality of distributions at each hierarchy level and improving sample efficiency and quality.

1. Rectified Flow and the Motivation Behind Hierarchical Extensions

Classic Rectified Flow (RF) modeling seeks to learn a time-dependent velocity field $v(x,t)$ that transports data from a tractable source distribution $p_0(x)$ (e.g. a standard Gaussian) to a complex target distribution $p_1(x)$ (e.g. images). The RF objective minimizing the expected mean-squared error between the ground-truth velocity and the learned flow is

$\mathcal{L}_{\rm RF} \;=\;\mathbb{E}_{x_0\sim p_0,\;x_1\sim p_1,\;t\sim U[0,1]} \Bigl\| (x_1-x_0) - v((1-t)x_0 + t x_1,\;t) \Bigr\|^2.$

This formulation forces the model to regress to the mean velocity at each space-time point, an approach that is sufficient for capturing multi-modal data at the level of distributions, but fails to represent the full multi-modal structure of the random velocity field itself. As a result, the integration paths traced during sampling are restricted from intersecting, leading to unduly curved paths and a requirement for a high number of NFEs.

The hierarchical rectified flow (HRF) framework generalizes RF by modeling not only the mean velocity but also the full random velocity distribution and its higher-order derivatives (acceleration, jerk) using a hierarchy of coupled ODEs. This richer approach yields straighter, intersecting integration paths, greatly reducing the numerical burden at generation time (Zhang et al., 24 Feb 2025).

2. Hierarchical ODE Structure and Stochastic Process Formulation

The HRF hierarchy is constructed by recursively introducing levels corresponding to location, velocity, acceleration, and so forth. Each level captures the distributional structure of the relevant "derivative" at each interpolation.

Location-domain ODE (depth 1):

$\frac{dz_t}{dt} = v(z_t, t), \qquad z_0 \sim p_0,\, t \in [0,1]$

Here, $z_t$ is the state evolving under the estimated velocity field.

Velocity-domain ODE (depth 2):

Introduces a velocity sample $u_0\sim T_0$ , a target velocity $u_1 = x_1 - x_0$ , and a new "velocity time" $T$ :

$\frac{du_T}{dT} = a(x_t, t, u_T, T)$

where $a$ is an acceleration model driving $u_T$ towards $u_1$ .

Higher-order domains (e.g., acceleration; depth 3+):

For acceleration: define a jerk sample $j_0$ , interpolate $j_T$ , and learn

$\frac{dj_S}{dS} = b(x_t, t, u_T, T, j_S, S)$

Output from the innermost ODE supplies the sample for the domain in the next outer loop.

The nested structure yields a stack of coupled ODEs, each corresponding to a higher-order deviation from the mean, and forms a time-differentiable stochastic process $\{X_t\}_{t\in[0,1]}$ with random velocity, acceleration, etc.

3. Training Objectives and Sample Generation Procedures

In HRF, the matching loss for each hierarchy level compares the model output to "ground-truth" differences constructed from data and sampled initial states. For depth $D$ , the objective is: $\mathcal{L}_{\rm HRF}^D = \mathbb{E}_{x_0^{(1)},...,x_0^{(D)},\,x_1,\,t^{(1:D)}}\Bigl\| \left(x_1 - \sum_{d=1}^D x_0^{(d)} \right) - f_\theta(x_t^{(1:D)}, t^{(1:D)}) \Bigr\|^2$ For depth-2 (velocity + acceleration matching), the HRF2 objective is: $\mathcal{L}_{\rm HRF2} = \mathbb{E}_{x_0, x_1, u_0, t, T}\Bigl\| (x_1 - x_0 - u_0) - a_\phi((1-t)x_0 + t x_1, t, u_T, T) \Bigr\|^2$ where $u_T = (1-T)u_0 + T(x_1-x_0)$ , and the ground-truth acceleration is $x_1 - x_0 - u_0$ .

Training Algorithm (HRF2):

initialize φ
while not converged do
  sample {x0} ~ p0, {x1} ~ p1, {u0} ~ T0
  sample t ~ Uniform(0,1), T ~ Uniform(0,1)
  compute x_t = (1-t)x0 + t x1
  compute u_T = (1-T) u0 + T (x1 - x0)
  gt_acc = (x1 - x0) - u0
  pred_acc = a_φ( x_t, t, u_T, T )
  loss = mean ∥ gt_acc – pred_acc ∥²
  φ ← φ – η ∇_φ loss
end

Sampling (Generation) Algorithm uses nested ODE integration via Euler or higher-order schemes:

input:  p0, T0, step counts J (for t), L (for T), learned a_φ
draw z0 ~ p0,   u0 ~ T0
for j=1…J do
  t←(j-1)/J
  for ℓ=1…L do
    T←(ℓ-1)/L
    u_ℓ = u_{ℓ-1} + a_φ(z_{j-1}, t, u_{ℓ-1}, T) * (1/L)
  end
  z_j = z_{j-1} + u_L * (1/J)
end
output z_J ≈ sample from p1

Distinct particles at the same

(z,t)

may propagate along intersecting paths in location space due to sampled velocity diversity.

4. Advances Through Mini-Batch Couplings

Mini-Batch Couplings introduce optimal transport (OT)-based structuring at each hierarchy level.

Data coupling (HRF2-D): Samples mini-batches $\{x_0^{(i)}\}, \{x_1^{(i)}\}$ , solves OT permutation matching to form joint samples. This collapses the multimodal velocity distribution to near unimodality, simplifying the learning landscape and straightening flows (Zhang et al., 17 Jul 2025).
Velocity coupling (HRF2-V): Given a pretrained model, simulates sets of velocities, then couples via OT to further straighten acceleration paths.

Two-Stage Hierarchical Coupling (HRF2-D&V):

Train with data coupling to simplify velocity domain.
Fine-tune with velocity coupling for improved path straightness and reduced integration step requirements.

These strategies enable fine control of distributional complexity at each ODE level, and empirical evidence suggests substantial sample quality improvements and step count reduction.

5. Neural Network Parameterizations and Conditioning

The function spaces required for HRF, including velocity and acceleration models, are parameterized as:

Domain	Model	Encoding
Velocity	MLP, U-Net	$v_\theta(x,t)$ : time + state
Acceleration	MLP, U-Net	$a_\phi(x,t,u,T)$ : all axes
High-order (jerk)	MLP/U-Net	All relevant args

Time Conditioning: Each continuous time variable (e.g., $t, T$ ) is embedded via sinusoidal positional encodings (8–32 dimensions) followed by a small MLP.
State Conditioning: States ( $x, u$ ) pass through linear or convolutional layers; embeddings are fused (addition or concatenation) with time embeddings.
Depth-2 (1D/2D) Examples: Separate embeddings for $(x_t, t)$ and $(u_T, T)$ , concatenated then passed through several dense layers (128–256 units).
Image Data: Dual U-Net streams for each axis, merged feature maps blockwise (ResNet-style).

These parameterizations accommodate multimodal, high-dimensional distributions requisite for hierarchical modeling.

6. Empirical Performance, Sample Quality, and Efficiency

Empirical studies on synthetic 1D mixtures (of Gaussians), 2D multi-modal targets (moons, mixture Gaussians), and image datasets (MNIST, CIFAR-10, ImageNet-32, and CelebA-HQ 256) reveal the following:

Metrics: Wasserstein distance (WD), sliced WD (SWD), Fréchet Inception Distance (FID) versus total NFEs.
Outcomes:
- HRF2 and HRF3 outperform RF and OTCFM at the same NFE budget; HRF trajectories are notably straighter and intersect.
- HRF2-D achieves FID reductions by 2× or greater at low and high NFE; e.g., MNIST FID ≈ 10 → 5, CIFAR-10 FID ≈ 23 → 10.
- HRF2-D&V maintains or improves sample quality at extreme low NFE (NFE=1): MNIST FID ≈ 7 → 3, CIFAR-10 FID ≈ 15 → 6.
- Qualitative samples display collapsed spurious modes and improved sharpness (see relevant figures in source papers).
- RF baselines typically require 40–80% more NFEs for comparable sample quality in both synthetic and real data scenarios.

7. Limitations, Implementation Considerations, and Future Research

Key limitations include:

Increased model complexity: Each hierarchy level requires a separate network (more parameters).
Sampling involves nested ODE solving; balancing step counts ( $J, L$ , etc.) for efficiency is nontrivial.
Training cost: Data coupling incurs OT computations per batch; velocity coupling necessitates data simulation.
Simple Euler solvers currently dominate; exploring higher-order or adaptive integrators (e.g., RK45, symplectic) may yield further gains.
Velocity coupling relies on simulation to generate target velocities; simulation-free coupling is a prospective direction.

Future directions:

Optimizing step count allocation across levels to minimize NFE.
Investigating adaptive and symplectic integrators for the nested structure.
Integration with variance reduction schemes and semantic latent-space (e.g., variational extensions).
Combining HRF with mini-batch optimal transport flow matching for improved sample diversity and quality.

A plausible implication is that HRF and its mini-batch coupling variants stand to further advance the efficiency and modeling fidelity of modern generative ODE-based approaches. The hierarchical architecture supports arbitrary depth and incremental control over distributional complexity, providing a flexible foundation for continued development within the generative modeling domain.

PDF Markdown Chat (Pro)

References (2)

Towards Hierarchical Rectified Flow (2025)

Hierarchical Rectified Flow Matching with Mini-Batch Couplings (2025)

Follow Topic

Get notified by email when new papers are published related to Hierarchical Rectified Flow Matching.