Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
115 tokens/sec
GPT-4o
67 tokens/sec
Gemini 2.5 Pro Pro
54 tokens/sec
o3 Pro
13 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
24 tokens/sec
2000 character limit reached

Conditional Flow-Matching in Latent Space

Updated 23 June 2025

Conditional flow-matching is a principled generative modeling framework that learns to map a simple base distribution to complex target distributions conditioned on auxiliary information, such as class labels, masks, or semantic layouts. The "Flow Matching in Latent Space" framework extends flow matching by operating in the latent space of pretrained autoencoders and is distinguished by its ability to flexibly and efficiently incorporate a wide variety of conditioning types, enabling practical conditional generation and manipulation tasks at high resolution and with reduced computational overhead.

1. Latent Space Flow Matching: Foundations and Methodology

Traditional flow matching approaches operate in pixel space, incurring prohibitive computational costs for high-resolution image synthesis. The latent space flow matching framework instead trains a conditional flow in the latent domain learned by a pretrained autoencoder (typically a VAE), leveraging the fact that autocoders’ latents encode core semantic information while reducing problem dimensionality.

Given a data sample x0p0\mathbf{x}_0 \sim p_0, encoding it with a pretrained E\mathcal{E} yields z0=E(x0)\mathbf{z}_0 = \mathcal{E}(\mathbf{x}_0) in latent space. The objective is to learn a vector field vθv_\theta such that for latent codes sampled from a Gaussian noise prior z1N(0,I)\mathbf{z}_1 \sim \mathcal{N}(0, \mathbf{I}), an ODE-based flow transports z1\mathbf{z}_1 to z0\mathbf{z}_0 across t[0,1]t \in [0, 1]. The flow is linear: zt=(1t)z0+tz1,\mathbf{z}_t = (1-t)\mathbf{z}_0 + t\mathbf{z}_1, and the target velocity field at zt\mathbf{z}_t is

vt=z1z0.v_t = \mathbf{z}_1 - \mathbf{z}_0.

Training minimizes

θ^=argminθEt,zt[z1z0vθ(zt,t)22].\hat{\theta} = \arg\min_\theta \mathbb{E}_{t, \mathbf{z}_t}\left[\| \mathbf{z}_1 - \mathbf{z}_0 - v_\theta(\mathbf{z}_t, t) \|_2^2\right].

At inference, noise z1\mathbf{z}_1 is transported along this flow (backward ODE integration) to a data-like latent, which is decoded by the pretrained decoder.

2. Conditioning Mechanisms and Applications

The framework supports general conditioning by augmenting the velocity field’s input with the condition variable(s), enabling several classes of conditional generative modeling tasks:

  • Label-Conditioned Generation: The class label c\mathbf{c} is appended to the network input, vθ(zt,c,t)v_\theta(\mathbf{z}_t, \mathbf{c}, t), and classifier-free guidance is introduced. The model is trained on both conditioned (c\mathbf{c} present) and unconditioned (randomly omitting c\mathbf{c}) samples; at generation, the outputs

v~θ(zt,c,t)=γvθ(zt,c,t)+(1γ)vθ(zt,,t)\tilde{v}_\theta(\mathbf{z}_t, \mathbf{c}, t) = \gamma v_\theta(\mathbf{z}_t, \mathbf{c}, t) + (1-\gamma) v_\theta(\mathbf{z}_t, \emptyset, t)

allow the sampling tradeoff between quality and diversity (without requiring an external classifier).

  • Image Inpainting: The condition is a mask plus the latent code of the masked input image. The velocity network receives

vθ(concat[zt,zm,mˉ],t),v_\theta(\operatorname{concat}[\mathbf{z}_t, \mathbf{z}_m, \bar{\mathbf{m}}], t),

where zm\mathbf{z}_m is the latent of the masked image and mˉ\bar{\mathbf{m}} the mask, providing information needed to synthesize missing regions.

  • Semantic-to-Image Generation: Conditioning on a one-hot semantic mask mm, which is projected and concatenated with the latent: vθ(concat[zt,mc],t)v_\theta(\operatorname{concat}[\mathbf{z}_t, m_c], t). The mask projection network is trained jointly with the flow.

This architecture is the first latent flow-matching model to flexibly support label, mask, and structural conditions within a single framework, directly enabling high-fidelity conditional generation, inpainting, semantic layout synthesis, and hybrid tasks.

3. Computational Advantages and Scalability

Operating in latent space confers significant practical benefits:

  • Reduced Dimensionality: For high-resolution images (e.g., 256×256256\times256), latent codes (e.g., 32×32×432\times32\times4) drastically shrink network and memory footprint.
  • Training and Sampling Speed: Fewer network parameters and hidden units, simplified ODEs, and lower function evaluation counts (NFE) per sample. For instance, on CelebA-HQ 256,

Latent FM (DiT-L/2): NFE=89,FID=5.26,Time=1.70 s\text{Latent FM (DiT-L/2): NFE} = 89, \quad \text{FID} = 5.26, \quad \text{Time} = 1.70\ \text{s}

compared to pixel-space FM (NFE=128,FID=7.34\text{NFE} = 128, \text{FID} = 7.34) and LDM (NFE=50,FID=5.11,Time=2.90 s\text{NFE} = 50, \text{FID} = 5.11, \text{Time} = 2.90\ \text{s}).

  • Scalability: The reduced overhead enables training on commodity hardware at larger resolutions (512×512512 \times 512 and above), a challenge for pixel-space flows.
  • ODE Solver Robustness: Performance remains stable across adaptive, Euler, and Heun integrators, further streamlining deployment.

4. Theoretical Guarantees and Loss Analysis

The paper provides a theoretical upper bound on the Wasserstein-2 distance between the decoded latent flow distribution and the true data distribution: W22(p0,p^0)Δfϕ,gτ(x0)2+Lgτ2e1+2L^01Rd/hv(zt,t)v^(z^t,t)2qtϕdzdt\mathcal{W}_2^2(p_0, \hat{p}_0) \leq \|\Delta_{f_\phi, g_\tau}(\mathbf{x}_0)\|^2 + L_{g_\tau}^2 e^{1 + 2\hat{L} \int_0^1 \int_{\mathbb{R}^{d/h}} \|v(\mathbf{z}_t, t) - \hat{v}(\hat{\mathbf{z}}_t, t)\|^2 q_t^\phi dz dt } Here, Δ\Delta quantifies the autoencoder reconstruction error, and the second term is controlled by the flow-matching objective. This result formally guarantees that better autoencoder backbones and lower flow-matching loss together translate to closer data-latent distribution matching in Wasserstein metric, providing justification for the empirical efficacy of the method.

5. Empirical Evaluation and Performance Benchmarks

The latent flow-matching framework demonstrates strong performance across unconditional and conditional tasks:

  • Unconditional Generation: Attains FID and recall competitive with or superior to leading diffusion models in the latent domain.
  • Conditional Generation (ImageNet, etc.): With classifier-free guidance, achieves FID as low as 4.46 (DiT-B/2), outperforming latent diffusion with matched model size.
  • Image Inpainting: FID of 4.09, approaching state-of-the-art (LaMa 3.98, MAT 2.94) even with a basic latent concatenation approach.
  • Semantic-to-Image: FID of 26.3, surpassing several domain-specific baselines and competitive with SPADE.
  • Ablation Studies: Show stability and minimal trade-offs between ODE solver choice and conditional task performance.

6. Outlook and Future Directions

Proposed directions include:

  • Stronger Backbone Autoencoders: The efficacy of latent flow matching depends on autoencoder quality; improvements directly tighten the Wasserstein bound.
  • Scaling to Larger and Multimodal Domains: Extension to higher resolutions, video, and multimodal generation (e.g., text-to-image) is plausible.
  • Richer Conditioning and Guidance: Expanding conditioning to encompass text, multimodal data, and advanced guidance schemes.
  • Theoretical and Algorithmic Advances: Tighter theoretical analysis, advanced ODE solvers, trajectory regularization, integration with consistency models or adversarial objectives.
  • Mode Coverage and Coupling: Deeper paper of how the latent space flow influences global data mode coverage and coupling constructions.

Aspect Latent Flow Matching Pixel-space Flow Matching
Domain Latent (VAE) representations Raw pixel space
Computational Efficiency High (fast) Low (slow)
Scalability Up to 512×512 images Limited
Conditionality Class, mask, semantic, etc. Not supported
Classifier-free Guidance Yes Not supported
Theoretical Guarantees Wasserstein bound in latent space No latent-space theory
Empirical Quality SOTA-competitive FID/recall Lower quality, slower

Conditional flow matching in latent space, as proposed in "Flow Matching in Latent Space," offers a flexible, theoretically justified, and practically efficient route for high-quality, conditionally controllable generative modeling. By leveraging autoencoder latents, streamlined ODE-based flows, and modular conditioning, it establishes a foundation for scalable synthesis and manipulation of images and potentially other high-dimensional signals in diverse applications.