Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

139 tokens/sec

GPT-4o

47 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Latent Conditional Flow Matching (CFM)

Updated 1 July 2025

Latent Conditional Flow Matching (CFM) is a unified generative modeling framework that leverages conditional probability paths and optimal transport for mapping noise to complex data distributions.
It directly regresses neural vector fields onto prescribed probability flows, combining strengths of continuous normalizing flows with diffusion models.
Latent CFM scales high-resolution and conditional tasks by operating in latent space, reducing computational cost while offering theoretical guarantees on generation quality.

Latent Conditional Flow Matching (CFM) is a powerful generative modeling framework that extends and unifies continuous normalizing flows (CNFs) and score-based diffusion models, enabling efficient, simulation-free training of deep generative models with broad applicability. By leveraging conditional probability paths—particularly those based on optimal transport (OT)—CFM achieves robust, stable, and computationally efficient mappings between source distributions (typically noise) and complex target data distributions. Recent variants further combine CFM with latent variable models, leading to significant advancements in scalability, sample quality, training efficiency, and downstream conditional generation capabilities.

1. Core Principles of Flow Matching and Conditional Flow Matching

Flow Matching (FM) is designed to address the inefficiencies of classical CNF likelihood training and the rigidness of diffusion score matching. Rather than simulating stochastic processes or maximizing likelihoods via expensive ODE integrations, FM trains generative models by directly regressing a neural vector field onto the exact vector fields defining prescribed probability flows between noise and data.

Given a probability path $p_t(x)$ generated by a vector field $u_t(x)$ , FM optimizes: $\mathcal{L}_{FM}(\theta) = \mathbb{E}_{t, p_t(x)} \left\| v_t(x; \theta) - u_t(x) \right\|^2$ Here, $v_t(x; \theta)$ is the neural approximation to the true vector field, and $t$ is the interpolant (time) parameter.

Conditional Flow Matching (CFM) generalizes FM to sample-specific, per-path conditional distributions: $\mathcal{L}_{CFM}(\theta) = \mathbb{E}_{t, q(x_1), p_t(x|x_1)} \left\| v_t(x; \theta) - u_t(x|x_1) \right\|^2$ where $u_t(x|x_1)$ is the vector field that generates the conditional path from noise to a data sample $x_1$ .

The CFM objective enables simulation-free, fully supervised learning of vector fields for a wide variety of interpolating probability paths.

2. Conditional Probability Paths and Optimal Transport Flows

The choice of conditional probability paths $p_t(x|x_1)$ , and consequently the form of $u_t(x|x_1)$ , is central to the expressiveness and efficiency of CFM:

Gaussian Conditional Paths: A family where

$p_t(x|x_1) = \mathcal{N}(x; \mu_t(x_1), \sigma_t(x_1)^2 I)$

with schedules for $\mu_t(x_1)$ and $\sigma_t(x_1)$ interpolating between base noise and data.

Diffusion Paths: Recover standard SDE-based interpolants; vector fields become those of score-based diffusion.
Optimal Transport (OT) Interpolants: Of particular practical significance, OT defines straight-line flows in mean and variance from noise to data:

$u_t(x|x_1) = \frac{x_1 - (1-\sigma_{\min}) x}{1-(1-\sigma_{\min}) t}$

This yields direct, linear flows that are computationally easier for neural networks to fit and for ODE solvers to integrate, resulting in improved efficiency and generalization.

Empirical studies demonstrate that OT-based CFM achieves lower negative log-likelihood (NLL) and Fréchet Inception Distance (FID) than diffusion score-based approaches, especially on large-scale datasets such as ImageNet.

3. Latent Space Flow Matching: Motivation and Practice

Many high-dimensional data domains feature latent, low-dimensional structure or significant multi-modality. Latent Conditional Flow Matching (Latent-CFM) incorporates latent variable models (e.g., VAEs) to capture this structure, carrying out the flow matching within the latent space learned by a pre-trained or jointly-trained encoder.

Each data point $\mathbf{x}_1$ is encoded into a latent code $\mathbf{z}_0 = \mathcal{E}(\mathbf{x}_1)$ .
The flow model learns a vector field between noise $\mathbf{z}_1 \sim \mathcal{N}(0, I)$ and $\mathbf{z}_0$ :

$\mathrm{d}\mathbf{z}_t = v_\theta(\mathbf{z}_t, t) \,\mathrm{d}t$

Objective (for constant-velocity OT flows in latent space):

$\hat{\theta} = \arg\min_\theta \mathbb{E}_{t, \mathbf{z}_t} \left[\|\mathbf{z}_1 - \mathbf{z}_0 - v_\theta(\mathbf{z}_t, t)\|^2 \right]$

Sampling involves integrating the learned velocity field in latent space and decoding with $\mathcal{D}$ , producing data samples consistent with the underlying manifold.

This approach reduces the dimension and complexity of the generative modeling task, increasing computational efficiency, making feasible high-resolution (e.g., $256^2$ , $512^2$ ) and conditional generative tasks that are otherwise intractable.

4. Conditional Generation and Applications

Conditional CFM readily supports various conditioning signals by extending the velocity field to consume auxiliary inputs:

Label-conditioned generation: The velocity field is conditioned on class labels (or one-hot class vectors), enabling high-fidelity class-conditional sample synthesis.
Inpainting and semantic-to-image tasks: Conditioning on masked images, semantic maps, or other structured conditions is possible by concatenating their encoded forms to the latent inputs of the velocity field.

Representative results:

On ImageNet-1k ( $256 \times 256$ ), latent CFM models achieve competitive or better FID to leading diffusion and latent diffusion models (e.g., FID 4.46 with classifier-free guidance).
Inpainting on CelebA-HQ ($256$), FID of 4.09, outperforming non-diffusion baselines.

The framework generalizes easily to multimodal settings (e.g., text-and-image, point-clouds for robotics, or trajectory prediction), supporting a broad set of research domains.

5. Efficiency, Scalability, and Theoretical Guarantees

Key efficiency and theoretical properties include:

Computational efficiency: Latent-CFM models reduce the necessary number of ODE solver evaluations (NFE) and operate with dramatically lower memory, making both training and inference tractable on modest hardware.
Statistical generalization: By operating in compressed latent space, models are less susceptible to overfitting high-frequency noise and learn representations aligned with data manifold structure.
Wasserstein-2 guarantees: Minimizing the latent flow matching loss provides explicit upper bounds on Wasserstein-2 distance between generated and ground-truth data (after decoding), contingent on the quality of the underlying VAE.
Trade-off: While latent-space models may be bottlenecked by the quality and expressivity of the autoencoder backbone, empirical results suggest this is offset by gains in scalability and sample diversity.

A crucial insight is that, with strong latent models (e.g., diffusion-based autoencoders), the performance difference between latent and pixel-space CFM is minimal; and for large datasets or high resolution, only the latent variant is computationally feasible.

6. Empirical Results and Theoretical Analysis

Empirical benchmarks confirm:

Image generation: On datasets such as CelebA-HQ, FFHQ, LSUN, and ImageNet, latent CFM achieves better or competitive FID and recall compared to leading latent diffusion and flow-matching architectures.
Data-efficiency: Comparable FID is achieved with significantly fewer training iterations and less computational cost relative to pixel-space techniques.
Conditional tasks: Unified approach supports diverse conditional settings (label, mask, semantic map) without the need for task-specific architectural changes.
Theoretical control over generated distribution's proximity to true data, via explicit Wasserstein-2 guarantees derived from the loss bound.

The analysis also reveals:

Further improvements are possible by scaling model size (e.g., DiT-L/2 or DiT-XL/2 as velocity field backbones).
Pretrained diffusion or advanced autoencoders (e.g., Stable Diffusion) can be incorporated to further enhance sample fidelity and visual quality.

7. Extensions, Future Directions, and Availability

Directions for further research include:

Scaling: Applying larger transformer-based velocity fields and extending to resolutions equal to or above $512^2$ .
Text-to-image and multimodal generation: Developing modules for text or video conditioning, leveraging the generality of the conditional velocity field input design.
Autoencoder advances: Exploiting improvements in latent-space modeling to boost final generation quality.
Sample quality and ODE solver design: Systematic analysis of the trade-offs between solver steps (NFE), sample quality, and hardware efficiency.

The reference open-source implementation is available at https://github.com/VinAIResearch/LFM.git.

In summary, Latent Conditional Flow Matching enables scalable, efficient, and versatile generative modeling by combining optimal transport-based flow matching with latent variable representations. It achieves strong empirical performance on challenging benchmarks and supports a wide range of unconditional and conditional generation tasks, while providing explicit theoretical foundations for generation quality. This positions CFM and its latent variants as foundational tools in modern deep generative modeling research and applications.

PDF Markdown Chat (Pro)