Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution (2412.15213v1)

Published 19 Dec 2024 in cs.CV

Abstract: Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cross-modal tasks such as text-to-image generation, this same mapping from noise to image is learnt whilst including a conditioning mechanism in the model. One key and thus far relatively unexplored feature of flow matching is that, unlike Diffusion models, they are not constrained for the source distribution to be noise. Hence, in this paper, we propose a paradigm shift, and ask the question of whether we can instead train flow matching models to learn a direct mapping from the distribution of one modality to the distribution of another, thus obviating the need for both the noise distribution and conditioning mechanism. We present a general and simple framework, CrossFlow, for cross-modal flow matching. We show the importance of applying Variational Encoders to the input data, and introduce a method to enable Classifier-free guidance. Surprisingly, for text-to-image, CrossFlow with a vanilla transformer without cross attention slightly outperforms standard flow matching, and we show that it scales better with training steps and model size, while also allowing for interesting latent arithmetic which results in semantically meaningful edits in the output space. To demonstrate the generalizability of our approach, we also show that CrossFlow is on par with or outperforms the state-of-the-art for various cross-modal / intra-modal mapping tasks, viz. image captioning, depth estimation, and image super-resolution. We hope this paper contributes to accelerating progress in cross-modal media generation.

The paper "Flowing from Words to Pixels: A Framework for Cross-Modality Evolution" (Liu et al., 19 Dec 2024 ) proposes CrossFlow, a novel framework for cross-modal generative tasks based on flow matching. Traditional flow matching and diffusion models typically learn a mapping from Gaussian noise to the target data distribution, incorporating conditioning mechanisms (like cross-attention for text-to-image) when a source modality is involved. CrossFlow introduces a paradigm shift by training flow matching models to directly learn the mapping from the distribution of one modality (the source) to the distribution of another (the target), thereby removing the need for a noise source and explicit conditioning mechanisms.

The core idea is to treat the source modality data itself as the starting point (z0z_0) for the flow matching process that evolves towards the target modality data (z1z_1). This presents two main practical challenges:

  1. Shape Mismatch: Different modalities naturally have different data shapes (e.g., text sequences vs. image grids).
  2. Classifier-Free Guidance (CFG): CFG, crucial for generation quality, relies on comparing conditional and unconditional predictions, which is not directly applicable when the source data itself is the "condition."

CrossFlow addresses these challenges with specific architectural and training designs. To handle the shape mismatch and provide a suitable source distribution, the framework employs a Variational Encoder (VE) for the source modality. For instance, in text-to-image, a Text VE encodes text embeddings into a latent space with the same shape as the image latent space. The paper emphasizes that formulating z0z_0 as a regularized distribution via the VE is essential for flow matching to work effectively in this cross-modal setting. The VE is trained to encode the input xx into parameters (μˉz0,σˉz02\bar{\mu}_{z_0}, \bar{\sigma}_{z_0}^2) of a Gaussian distribution from which z0z_0 is sampled.

For enabling CFG without explicit conditioning, CrossFlow introduces a binary indicator (1c1_c). This indicator, set to 1 for conditional generation and 0 for a form of "unconditional" generation (mapping z0z_0 to a general target distribution sample z1ucz_1^{uc}), is used during training. The model vθv_\theta is trained to predict the velocity field conditioned on this indicator, i.e., vθ(zt,t,1c)v_\theta(z_t, t, 1_c). During inference, CFG can then be applied by extrapolating between the predictions made with 1c=11_c=1 and 1c=01_c=0.

The training objective for CrossFlow is a joint optimization of the flow matching model and the VE. It combines the standard flow matching MSE loss (LFML_{FM}) between the predicted velocity vθ(zt,t)v_\theta(z_t, t) and the ground truth velocity v^=z1(1σmin)z0\hat{v} = z_1 - (1-\sigma_{min})z_0, with the VE training losses: an encoding loss (LEncL_{Enc}) and a KL-divergence loss (LKLL_{KL}) regularizing the latent space towards a prior (e.g., N(0,1)\mathcal{N}(0, 1)).

L=LFM+LEnc+λLKLL = L_{FM} + L_{Enc} + \lambda L_{KL}

For the encoding loss LEncL_{Enc}, instead of a traditional VAE reconstruction loss on the input modality, the authors found that a contrastive loss between the source latent (z0z_0) and a target feature representation (z^\hat{z}) works significantly better for capturing semantic information. They specifically use an image-text contrastive loss (based on CLIP) for text-to-image, where z^\hat{z} is derived from the paired image. Jointly training the VE and the flow matching model from scratch, or training the VE first and then jointly finetuning with flow matching, was found to be more effective than separate training.

The flow matching model vθv_\theta itself can be a standard architecture like a vanilla Transformer (e.g., a DiT variant adapted for flow matching), crucially without needing cross-attention layers for text-to-image. The Text VE for text-to-image can also be built using Transformer blocks followed by a projection layer. CrossFlow can operate on latent spaces (e.g., using a pre-trained VAE for images) for efficiency or directly on pixel space.

Practical Applications and Results:

The paper demonstrates CrossFlow's effectiveness across various cross-modal and intra-modal tasks:

  1. Text-to-Image Generation:
    • Compared to a standard flow matching baseline using a noise source and text cross-attention, CrossFlow (using a vanilla transformer without cross-attention) achieves comparable or slightly better performance (FID, CLIP score) with similar model sizes and training budgets.
    • Scalability: CrossFlow shows better scaling characteristics with increasing model size and training steps compared to the standard baseline, suggesting stronger potential for larger models and more training.
    • State-of-the-art: Achieves performance competitive with recent SOTA text-to-image models (FID ~9.6 on COCO, GenEval ~0.55), often with significantly less training compute.
    • Latent Arithmetic: A unique benefit of encoding the source modality into a regularized semantic space is the ability to perform meaningful arithmetic operations (like interpolation and vector math) directly in the source latent space (z0z_0), which translates into semantically coherent edits in the generated images.
  2. Image Captioning (Image to Text): CrossFlow is trained to map image latents to text latents. It achieves comparable performance to non-autoregressive SOTA methods on the COCO dataset (Karpathy split).
  3. Monocular Depth Estimation (Image to Depth): CrossFlow is trained to map image pixels to depth pixels. It achieves performance comparable to SOTA methods on KITTI and NYUv2 datasets, using a unified framework without task-specific designs. It also shows competitive results in zero-shot depth estimation across various real-world datasets.
  4. Image Super-Resolution (Low-Res to High-Res Image): CrossFlow maps an upsampled low-resolution image directly to a high-resolution image. It outperforms standard flow matching and diffusion (SR3) baselines for 64×64256×25664\times 64 \rightarrow 256\times 256 super-resolution on ImageNet, demonstrating its effectiveness even for intra-modal tasks involving distribution evolution.

Implementation Considerations:

  • The Variational Encoder design and training objective (especially the use of contrastive loss) are critical for performance, significantly impacting the quality of the learned source latent space.
  • Joint training of the VE and flow matching model is preferable for best results and convergence, although initializing with a pre-trained VE and then jointly finetuning can speed up initial convergence.
  • The CFG indicator mechanism is shown to be a practical way to apply CFG in this direct mapping setup, outperforming alternative guidance methods like Autoguidance in their experiments.
  • CrossFlow is compatible with different LLMs for text processing, and performance generally improves with more powerful LMs, although it works well even with a smaller model like CLIP.
  • While CrossFlow requires a VE for the source modality, it simplifies the flow matching model architecture by potentially removing the need for complex cross-attention mechanisms, especially when using vanilla transformers.

In summary, CrossFlow presents a simple, general, and effective framework for cross-modal and intra-modal generation by directly learning the transport between modality distributions using flow matching. By incorporating a Variational Encoder and an indicator-based CFG mechanism, it overcomes key challenges and demonstrates competitive performance across diverse tasks, while offering novel capabilities like latent arithmetic and showing promising scalability.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Qihao Liu (23 papers)
  2. Xi Yin (88 papers)
  3. Alan Yuille (294 papers)
  4. Andrew Brown (31 papers)
  5. Mannat Singh (13 papers)