Align Your Flow: Scalable Generative Distillation
- Align Your Flow is a framework that uses continuous-time flow maps to convert generative models into efficient, few-step samplers.
- It unifies flow-based, diffusion-based, and consistency-based paradigms through novel Eulerian and Lagrangian distillation objectives.
- The approach achieves state-of-the-art performance in high-resolution and text-to-image synthesis while ensuring robust quality in multi-step sampling.
Align Your Flow is a general and scalable framework for distilling high-performing generative diffusion or flow-based models into efficient few-step samplers by learning continuous-time flow maps. The approach connects and generalizes the paradigms of flow-based, diffusion-based, and consistency-based generative modeling, introducing new objectives and training methods that achieve state-of-the-art performance on high-resolution and text-conditioned image synthesis. The central contribution is a distillation strategy that yields models capable of producing high-fidelity samples in very few, or even a single, integration step—while maintaining robustness across a wide range of step counts.
1. Continuous-Time Flow Map Objectives
Align Your Flow (AYF) builds on the notion of flow maps, parameterized functions that deterministically transport a point from noise level to arbitrary target noise level , including the data distribution at . The framework introduces two new continuous-time training objectives for learning these maps:
- Eulerian Map Distillation (AYF-EMD): This objective enforces that the flow map output at target time remains invariant as the input is infinitesimally transported along the probability flow ODE towards . The continuous-time loss is
This loss generalizes both the flow matching (when ) and the consistency model loss (when ), unifying the two within a continuous-time map framework.
- Lagrangian Map Distillation (AYF-LMD): This objective considers the trajectory of for a fixed as a function of , enforcing consistency between map trajectory and the PF-ODE vector field. The loss is
where is the teacher model's vector field.
Both objectives are shown to be analytically well-founded and are derived to produce correct multi-step stochastic and deterministic samplers for arbitrary step counts, overcoming central limitations of previous distillation techniques.
2. Training Strategies and Generalization of Prior Models
AYF incorporates several training innovations to ensure stability and accuracy:
- Parameterization and Tangent Normalization: Time embeddings and velocity normalization stabilizes optimization dynamics, managing rapid variations arising from continuous-time formulations.
- Tangent Warmup and Regularization: Early training encourages linearity in the flow map, gently introducing higher-order effects once the model can stably support them.
- Flexible Time Scheduling: Sampling pairs across a diverse range ensures the flow map is reliable for both short and long-range transitions.
- Stop-Gradient Targeting: Applying stop-gradient to the targets of the EMD and LMD losses avoids backpropagation through Jacobian-vector products, preventing instability.
AYF-EMD subsumes prior consistency and flow-matching objectives as limiting cases, providing a universal training target for flow maps. Discrete-time variants correspond to and improve on Trajectory Consistency Distillation and related algorithms.
3. Performance Benefits and Multi-Step Robustness
AYF-trained flow maps (“Align Your Flow” models) deliver major practical advantages:
- Efficient Few-Step Sampling: Unlike consistency models, which degrade in performance when sampled with more than two steps, AYF flow maps maintain high quality from 1 to 8 or 16 steps (or arbitrary schedules). This results from the explicit design of the objectives for all pairs.
- Competitive or Better Quality with Small Models: On ImageNet 64x64 and 512x512, AYF achieves state-of-the-art FID and Recall among all non-adversarial, few-step samplers. Small AYF models outperform much larger prior models at low computational cost (e.g., ~18% floating-point operations compared to best prior XXL models).
- Sampling Flexibility: Multi-step or deterministic sampling settings do not degrade quality, in contrast to consistency-based approaches, whose theoretical and practical performance gaps are demonstrated analytically and empirically.
Method/Class | 1–2 Step Quality | Multi-Step Quality | Model Size Required | Sampling Efficiency |
---|---|---|---|---|
Consistency Models | High | Poor (degrades) | XXL | High (few steps) |
Classic Flow Models | Moderate | High (many steps) | Large | Slow (>20 steps) |
Shortcut Models | Decent | Good (1–8 steps) | Medium | High (few steps) |
AYF (Ours) | Excellent | Excellent | Small–Medium | Excellent |
4. Autoguidance and Adversarial Finetuning
AYF introduces two additional procedures to further enhance sample quality:
- Autoguidance: A weaker, low-quality teacher (early checkpoint) provides additional guidance signal. Formally,
This focuses training on regions where the baseline teacher is weak, sharpening the flow map’s outputs without distorting the conditional distribution or requiring explicit classifier-free guidance.
- Adversarial Finetuning: Brief, post-hoc adversarial finetuning is applied to an AYF model (adding a discriminator loss) to sharpen samples, particularly at one-step sampling. Unlike many adversarially-trained generative models, this does not sacrifice diversity, as confirmed by empirical recall metrics.
5. Text-to-Image Application and User Study Results
AYF flow maps are successfully extended to text-to-image generation. A LoRA lightweight flow map distillation of FLUX.1 [dev], a proprietary large diffusion model, produces high-quality images in few steps.
- User Studies: Human assessment strongly prefers AYF LoRA-generated images over leading non-adversarial distillation baselines (LCM, TCD on SDXL), even when using much smaller models and fewer generation steps.
- Sample Sharpness and Quality: AYF rivals commercial adversarially-trained one-step samplers in both perceived quality and visual details, but requires no adversarial loss during main training.
- Training Speed: LoRA-based flow maps are distilled in only 4 hours on 8 GPUs, demonstrating high practical scalability.
6. Theoretical and Methodological Impact
The AYF framework strengthens rigorous, continuous-time training of flow maps and highlights theoretical limitations of prior approaches:
- Theory: AYF proves that previous consistency-model-based approaches cannot extend gracefully to arbitrary step counts, explaining and confirming the sharp empirical performance drop in multi-step settings.
- Methodology: By expressing the flows as a map between arbitrary times and target noise levels, the model is valid under both stochastic and deterministic samplers and amalgamates the best of ODE/SDE, flow, and consistency frameworks.
- Unification of Paradigms: This approach synthesizes and generalizes advancements in consistency models, probability flow, and shortcut-based distillation under a mathematically robust, continuous-time map formulation.
Summary Table: Align Your Flow vs. Prior Methods
Method/Objective | Flexibility (Step Counts) | Sampling Efficiency | Quality Stability | Model Size Efficiency | Guidance/Finetune | Adversarial Optionality |
---|---|---|---|---|---|---|
Consistency | 1–2 steps only | High | Degrades >2 steps | XXL typically needed | No | No |
Flow Matching | All steps, but slow | Low (many steps) | Stable | Large | No | No |
Shortcut Models | 1–8 steps | High | Reasonable | Medium | No | No |
AYF (Ours) | Arbitrary | High | Stable | Small–Medium | Autoguidance | Optional, robust |
AYF establishes a versatile and rigorous approach for scalable, robust, and efficient distillation of high-quality generative models. By generalizing and unifying previous methodologies through new continuous-time map objectives, AYF achieves state-of-the-art few-step, high-resolution, and text-conditioned image generation using small and efficient neural networks. The framework thereby enables robust, practical deployment of efficient generative modeling workflows across a wide array of settings.