Conditional Flow-Matching in Latent Space

Updated 23 June 2025

Conditional flow-matching is a principled generative modeling framework that learns to map a simple base distribution to complex target distributions conditioned on auxiliary information, such as class labels, masks, or semantic layouts. The "Flow Matching in Latent Space" framework extends flow matching by operating in the latent space of pretrained autoencoders and is distinguished by its ability to flexibly and efficiently incorporate a wide variety of conditioning types, enabling practical conditional generation and manipulation tasks at high resolution and with reduced computational overhead.

1. Latent Space Flow Matching: Foundations and Methodology

Traditional flow matching approaches operate in pixel space, incurring prohibitive computational costs for high-resolution image synthesis. The latent space flow matching framework instead trains a conditional flow in the latent domain learned by a pretrained autoencoder (typically a VAE), leveraging the fact that autocoders’ latents encode core semantic information while reducing problem dimensionality.

Given a data sample $\mathbf{x}_0 \sim p_0$ , encoding it with a pretrained $\mathcal{E}$ yields $\mathbf{z}_0 = \mathcal{E}(\mathbf{x}_0)$ in latent space. The objective is to learn a vector field $v_\theta$ such that for latent codes sampled from a Gaussian noise prior $\mathbf{z}_1 \sim \mathcal{N}(0, \mathbf{I})$ , an ODE-based flow transports $\mathbf{z}_1$ to $\mathbf{z}_0$ across $t \in [0, 1]$ . The flow is linear: $\mathbf{z}_t = (1-t)\mathbf{z}_0 + t\mathbf{z}_1,$ and the target velocity field at $\mathbf{z}_t$ is

$v_t = \mathbf{z}_1 - \mathbf{z}_0.$

Training minimizes

$\hat{\theta} = \arg\min_\theta \mathbb{E}_{t, \mathbf{z}_t}\left[\| \mathbf{z}_1 - \mathbf{z}_0 - v_\theta(\mathbf{z}_t, t) \|_2^2\right].$

At inference, noise $\mathbf{z}_1$ is transported along this flow (backward ODE integration) to a data-like latent, which is decoded by the pretrained decoder.

2. Conditioning Mechanisms and Applications

The framework supports general conditioning by augmenting the velocity field’s input with the condition variable(s), enabling several classes of conditional generative modeling tasks:

Label-Conditioned Generation: The class label $\mathbf{c}$ is appended to the network input, $v_\theta(\mathbf{z}_t, \mathbf{c}, t)$ , and classifier-free guidance is introduced. The model is trained on both conditioned ( $\mathbf{c}$ present) and unconditioned (randomly omitting $\mathbf{c}$ ) samples; at generation, the outputs

$\tilde{v}_\theta(\mathbf{z}_t, \mathbf{c}, t) = \gamma v_\theta(\mathbf{z}_t, \mathbf{c}, t) + (1-\gamma) v_\theta(\mathbf{z}_t, \emptyset, t)$

allow the sampling tradeoff between quality and diversity (without requiring an external classifier).

Image Inpainting: The condition is a mask plus the latent code of the masked input image. The velocity network receives

$v_\theta(\operatorname{concat}[\mathbf{z}_t, \mathbf{z}_m, \bar{\mathbf{m}}], t),$

where $\mathbf{z}_m$ is the latent of the masked image and $\bar{\mathbf{m}}$ the mask, providing information needed to synthesize missing regions.

Semantic-to-Image Generation: Conditioning on a one-hot semantic mask $m$ , which is projected and concatenated with the latent: $v_\theta(\operatorname{concat}[\mathbf{z}_t, m_c], t)$ . The mask projection network is trained jointly with the flow.

This architecture is the first latent flow-matching model to flexibly support label, mask, and structural conditions within a single framework, directly enabling high-fidelity conditional generation, inpainting, semantic layout synthesis, and hybrid tasks.

3. Computational Advantages and Scalability

Operating in latent space confers significant practical benefits:

Reduced Dimensionality: For high-resolution images (e.g., $256\times256$ ), latent codes (e.g., $32\times32\times4$ ) drastically shrink network and memory footprint.
Training and Sampling Speed: Fewer network parameters and hidden units, simplified ODEs, and lower function evaluation counts (NFE) per sample. For instance, on CelebA-HQ 256,

$\text{Latent FM (DiT-L/2): NFE} = 89, \quad \text{FID} = 5.26, \quad \text{Time} = 1.70\ \text{s}$

compared to pixel-space FM ( $\text{NFE} = 128, \text{FID} = 7.34$ ) and LDM ( $\text{NFE} = 50, \text{FID} = 5.11, \text{Time} = 2.90\ \text{s}$ ).

Scalability: The reduced overhead enables training on commodity hardware at larger resolutions ( $512 \times 512$ and above), a challenge for pixel-space flows.
ODE Solver Robustness: Performance remains stable across adaptive, Euler, and Heun integrators, further streamlining deployment.

4. Theoretical Guarantees and Loss Analysis

The paper provides a theoretical upper bound on the Wasserstein-2 distance between the decoded latent flow distribution and the true data distribution: $\mathcal{W}_2^2(p_0, \hat{p}_0) \leq \|\Delta_{f_\phi, g_\tau}(\mathbf{x}_0)\|^2 + L_{g_\tau}^2 e^{1 + 2\hat{L} \int_0^1 \int_{\mathbb{R}^{d/h}} \|v(\mathbf{z}_t, t) - \hat{v}(\hat{\mathbf{z}}_t, t)\|^2 q_t^\phi dz dt }$ Here, $\Delta$ quantifies the autoencoder reconstruction error, and the second term is controlled by the flow-matching objective. This result formally guarantees that better autoencoder backbones and lower flow-matching loss together translate to closer data-latent distribution matching in Wasserstein metric, providing justification for the empirical efficacy of the method.

5. Empirical Evaluation and Performance Benchmarks

The latent flow-matching framework demonstrates strong performance across unconditional and conditional tasks:

Unconditional Generation: Attains FID and recall competitive with or superior to leading diffusion models in the latent domain.
Conditional Generation (ImageNet, etc.): With classifier-free guidance, achieves FID as low as 4.46 (DiT-B/2), outperforming latent diffusion with matched model size.
Image Inpainting: FID of 4.09, approaching state-of-the-art (LaMa 3.98, MAT 2.94) even with a basic latent concatenation approach.
Semantic-to-Image: FID of 26.3, surpassing several domain-specific baselines and competitive with SPADE.
Ablation Studies: Show stability and minimal trade-offs between ODE solver choice and conditional task performance.

6. Outlook and Future Directions

Proposed directions include:

Stronger Backbone Autoencoders: The efficacy of latent flow matching depends on autoencoder quality; improvements directly tighten the Wasserstein bound.
Scaling to Larger and Multimodal Domains: Extension to higher resolutions, video, and multimodal generation (e.g., text-to-image) is plausible.
Richer Conditioning and Guidance: Expanding conditioning to encompass text, multimodal data, and advanced guidance schemes.
Theoretical and Algorithmic Advances: Tighter theoretical analysis, advanced ODE solvers, trajectory regularization, integration with consistency models or adversarial objectives.
Mode Coverage and Coupling: Deeper paper of how the latent space flow influences global data mode coverage and coupling constructions.

Aspect	Latent Flow Matching	Pixel-space Flow Matching
Domain	Latent (VAE) representations	Raw pixel space
Computational Efficiency	High (fast)	Low (slow)
Scalability	Up to 512×512 images	Limited
Conditionality	Class, mask, semantic, etc.	Not supported
Classifier-free Guidance	Yes	Not supported
Theoretical Guarantees	Wasserstein bound in latent space	No latent-space theory
Empirical Quality	SOTA-competitive FID/recall	Lower quality, slower

Conditional flow matching in latent space, as proposed in "Flow Matching in Latent Space," offers a flexible, theoretically justified, and practically efficient route for high-quality, conditionally controllable generative modeling. By leveraging autoencoder latents, streamlined ODE-based flows, and modular conditioning, it establishes a foundation for scalable synthesis and manipulation of images and potentially other high-dimensional signals in diverse applications.

PDF Markdown Chat (Pro)