The paper "Flowing from Words to Pixels: A Framework for Cross-Modality Evolution" (Liu et al., 19 Dec 2024 ) proposes CrossFlow, a novel framework for cross-modal generative tasks based on flow matching. Traditional flow matching and diffusion models typically learn a mapping from Gaussian noise to the target data distribution, incorporating conditioning mechanisms (like cross-attention for text-to-image) when a source modality is involved. CrossFlow introduces a paradigm shift by training flow matching models to directly learn the mapping from the distribution of one modality (the source) to the distribution of another (the target), thereby removing the need for a noise source and explicit conditioning mechanisms.
The core idea is to treat the source modality data itself as the starting point () for the flow matching process that evolves towards the target modality data (). This presents two main practical challenges:
- Shape Mismatch: Different modalities naturally have different data shapes (e.g., text sequences vs. image grids).
- Classifier-Free Guidance (CFG): CFG, crucial for generation quality, relies on comparing conditional and unconditional predictions, which is not directly applicable when the source data itself is the "condition."
CrossFlow addresses these challenges with specific architectural and training designs. To handle the shape mismatch and provide a suitable source distribution, the framework employs a Variational Encoder (VE) for the source modality. For instance, in text-to-image, a Text VE encodes text embeddings into a latent space with the same shape as the image latent space. The paper emphasizes that formulating as a regularized distribution via the VE is essential for flow matching to work effectively in this cross-modal setting. The VE is trained to encode the input into parameters () of a Gaussian distribution from which is sampled.
For enabling CFG without explicit conditioning, CrossFlow introduces a binary indicator (). This indicator, set to 1 for conditional generation and 0 for a form of "unconditional" generation (mapping to a general target distribution sample ), is used during training. The model is trained to predict the velocity field conditioned on this indicator, i.e., . During inference, CFG can then be applied by extrapolating between the predictions made with and .
The training objective for CrossFlow is a joint optimization of the flow matching model and the VE. It combines the standard flow matching MSE loss () between the predicted velocity and the ground truth velocity , with the VE training losses: an encoding loss () and a KL-divergence loss () regularizing the latent space towards a prior (e.g., ).
For the encoding loss , instead of a traditional VAE reconstruction loss on the input modality, the authors found that a contrastive loss between the source latent () and a target feature representation () works significantly better for capturing semantic information. They specifically use an image-text contrastive loss (based on CLIP) for text-to-image, where is derived from the paired image. Jointly training the VE and the flow matching model from scratch, or training the VE first and then jointly finetuning with flow matching, was found to be more effective than separate training.
The flow matching model itself can be a standard architecture like a vanilla Transformer (e.g., a DiT variant adapted for flow matching), crucially without needing cross-attention layers for text-to-image. The Text VE for text-to-image can also be built using Transformer blocks followed by a projection layer. CrossFlow can operate on latent spaces (e.g., using a pre-trained VAE for images) for efficiency or directly on pixel space.
Practical Applications and Results:
The paper demonstrates CrossFlow's effectiveness across various cross-modal and intra-modal tasks:
- Text-to-Image Generation:
- Compared to a standard flow matching baseline using a noise source and text cross-attention, CrossFlow (using a vanilla transformer without cross-attention) achieves comparable or slightly better performance (FID, CLIP score) with similar model sizes and training budgets.
- Scalability: CrossFlow shows better scaling characteristics with increasing model size and training steps compared to the standard baseline, suggesting stronger potential for larger models and more training.
- State-of-the-art: Achieves performance competitive with recent SOTA text-to-image models (FID ~9.6 on COCO, GenEval ~0.55), often with significantly less training compute.
- Latent Arithmetic: A unique benefit of encoding the source modality into a regularized semantic space is the ability to perform meaningful arithmetic operations (like interpolation and vector math) directly in the source latent space (), which translates into semantically coherent edits in the generated images.
- Image Captioning (Image to Text): CrossFlow is trained to map image latents to text latents. It achieves comparable performance to non-autoregressive SOTA methods on the COCO dataset (Karpathy split).
- Monocular Depth Estimation (Image to Depth): CrossFlow is trained to map image pixels to depth pixels. It achieves performance comparable to SOTA methods on KITTI and NYUv2 datasets, using a unified framework without task-specific designs. It also shows competitive results in zero-shot depth estimation across various real-world datasets.
- Image Super-Resolution (Low-Res to High-Res Image): CrossFlow maps an upsampled low-resolution image directly to a high-resolution image. It outperforms standard flow matching and diffusion (SR3) baselines for super-resolution on ImageNet, demonstrating its effectiveness even for intra-modal tasks involving distribution evolution.
Implementation Considerations:
- The Variational Encoder design and training objective (especially the use of contrastive loss) are critical for performance, significantly impacting the quality of the learned source latent space.
- Joint training of the VE and flow matching model is preferable for best results and convergence, although initializing with a pre-trained VE and then jointly finetuning can speed up initial convergence.
- The CFG indicator mechanism is shown to be a practical way to apply CFG in this direct mapping setup, outperforming alternative guidance methods like Autoguidance in their experiments.
- CrossFlow is compatible with different LLMs for text processing, and performance generally improves with more powerful LMs, although it works well even with a smaller model like CLIP.
- While CrossFlow requires a VE for the source modality, it simplifies the flow matching model architecture by potentially removing the need for complex cross-attention mechanisms, especially when using vanilla transformers.
In summary, CrossFlow presents a simple, general, and effective framework for cross-modal and intra-modal generation by directly learning the transport between modality distributions using flow matching. By incorporating a Variational Encoder and an indicator-based CFG mechanism, it overcomes key challenges and demonstrates competitive performance across diverse tasks, while offering novel capabilities like latent arithmetic and showing promising scalability.