Improving Progressive Generation with Decomposable Flow Matching (2506.19839v1)

Published 24 Jun 2025 in cs.CV and cs.AI

Abstract: Generating high-dimensional visual modalities is a computationally intensive task. A common solution is progressive generation, where the outputs are synthesized in a coarse-to-fine spectral autoregressive manner. While diffusion models benefit from the coarse-to-fine nature of denoising, explicit multi-stage architectures are rarely adopted. These architectures have increased the complexity of the overall approach, introducing the need for a custom diffusion formulation, decomposition-dependent stage transitions, add-hoc samplers, or a model cascade. Our contribution, Decomposable Flow Matching (DFM), is a simple and effective framework for the progressive generation of visual media. DFM applies Flow Matching independently at each level of a user-defined multi-scale representation (such as Laplacian pyramid). As shown by our experiments, our approach improves visual quality for both images and videos, featuring superior results compared to prior multistage frameworks. On Imagenet-1k 512px, DFM achieves 35.2% improvements in FDD scores over the base architecture and 26.4% over the best-performing baseline, under the same training compute. When applied to finetuning of large models, such as FLUX, DFM shows faster convergence speed to the training distribution. Crucially, all these advantages are achieved with a single model, architectural simplicity, and minimal modifications to existing training pipelines.

Summary

The paper introduces Decomposable Flow Matching (DFM), achieving a 35.2% FDD improvement on ImageNet-1K and outperforming state-of-the-art baselines.
The paper employs a multiscale decomposition using a Laplacian pyramid and independent flow timesteps to progressively generate high-quality images and videos.
The paper demonstrates that DFM reduces training iterations and computational overhead while enabling seamless integration into existing generative pipelines.

Improving Progressive Generation with Decomposable Flow Matching

The paper introduces Decomposable Flow Matching (DFM), a progressive generation framework for high-dimensional visual data, such as images and videos. DFM is designed to address the computational and architectural complexities inherent in existing progressive generative models, particularly those based on diffusion and flow-matching paradigms. The method is evaluated on large-scale benchmarks, including ImageNet-1K and Kinetics-700, and is further validated through fine-tuning experiments on large models like FLUX.

Motivation and Context

Progressive generation decomposes the synthesis of complex visual data into a sequence of simpler sub-tasks, typically proceeding from coarse to fine detail. While this approach is conceptually aligned with the denoising process in diffusion models, most prior work either requires multiple models (as in cascaded diffusion), custom diffusion processes, or intricate stage transition mechanisms. These requirements introduce significant engineering overhead and can limit flexibility, especially when adapting to new data modalities or decompositions.

DFM is proposed as a solution that maintains the benefits of progressive generation while minimizing architectural and procedural complexity. The framework is agnostic to the choice of decomposition and does not require multiple models or specialized samplers.

Methodology

Multiscale Decomposition

DFM operates by decomposing the input (e.g., an image or video) into a user-defined multiscale representation, such as a Laplacian pyramid. Each scale corresponds to a generative stage, with the coarsest scale capturing global structure and finer scales capturing increasing detail. The decomposition is flexible and can be instantiated using various techniques (Laplacian, DWT, DCT, etc.), though the Laplacian pyramid is used in the main experiments for its simplicity.

Flow Matching Extension

At the core, DFM extends the Flow Matching framework by introducing independent flow timesteps for each scale. During training, a stage is randomly selected, and noise is injected according to a schedule that simulates progressive generation: the current stage receives a sampled noise level, preceding stages receive low noise, and subsequent stages are maximally noised. The model is trained to predict per-scale velocities, with a masking mechanism to ignore loss contributions from fully noised stages.

During inference, a standard ODE sampler is applied sequentially to each stage, following a user-defined schedule. The process starts from the coarsest scale and proceeds to finer scales, with each stage being denoised only after the previous one reaches a threshold. This enables intermediate outputs and efficient previewing.

Architectural Adaptations

The DiT (Diffusion Transformer) architecture is adapted for DFM by introducing per-scale patchification and time embedding layers. Patch sizes are chosen to ensure consistent token counts across scales, facilitating spatial alignment. The transformer backbone processes the sum of patch embeddings from all scales, and per-scale projection layers output the predicted velocities. This design allows for a single model to handle all stages, in contrast to cascaded approaches.

Experimental Results

Quantitative Performance

DFM demonstrates strong improvements over both single-stage and state-of-the-art progressive baselines:

On ImageNet-1K 512px, DFM achieves a 35.2% improvement in FDD over the base architecture and 26.4% over the best-performing baseline under equal training compute.
On large-scale models (FLUX), DFM fine-tuning yields a 29.7% reduction in FID and a 3.7% increase in CLIP score compared to standard full fine-tuning.

These results are consistent across multiple resolutions and modalities (images and videos), and DFM matches or exceeds baseline performance with fewer training iterations.

Ablation Studies

Extensive ablations are conducted to analyze the impact of:

Training timestep distributions: Allocating more training capacity to the coarsest stage improves structural fidelity.
Input decomposition strategies: Lower base resolutions for the first stage (e.g., 256px) yield better results, as they focus the model on global structure.
Parameter specialization: Introducing per-stage expert parameters (e.g., in MLP layers) can further improve performance, though the main experiments use a shared-parameter model for simplicity.
Sampling schedules: Allocating more inference steps to the coarsest stage enhances structural quality, while the number of steps for finer stages can be reduced without significant loss.

Qualitative Analysis

DFM-generated samples exhibit improved structural coherence and detail compared to baselines, particularly in challenging classes and high-resolution settings. The progressive nature of the generation allows for intermediate previews and more interpretable synthesis dynamics.

Implementation Considerations

Integration and Deployment

Model Architecture: DFM can be integrated into existing DiT-based pipelines with minimal changes, requiring only the addition of per-scale patchification and time embedding layers.
Decomposition: The framework is decomposition-agnostic, but the choice of decomposition affects performance. Laplacian pyramids are recommended for general use.
Training: Hyperparameters such as stage sampling probabilities and noise schedules are critical. The paper provides practical guidance and default values that generalize well across datasets.
Inference: The sequential sampling schedule is straightforward to implement and can be tuned for speed-quality trade-offs.

Computational Requirements

Efficiency: DFM leverages lower-dimensional inputs for early stages, reducing compute relative to single-stage models at equivalent resolutions.
Scalability: The method scales to high resolutions (1024px+) and large video datasets without requiring multiple models or increased parameter counts.

Limitations

Hyperparameter Sensitivity: DFM introduces additional training and sampling hyperparameters, which require tuning for optimal performance.
High-Frequency Detail: Overemphasis on structural stages can lead to loss of fine detail in some cases, though this can be mitigated by adjusting sampling steps or training schedules.
Autoencoder Dependence: The quality of the multiscale decomposition is influenced by the spectral properties of the underlying autoencoder. Scale-equivariant autoencoders are recommended for best results.

Implications and Future Directions

DFM provides a practical and efficient framework for progressive generation in high-dimensional visual domains. Its simplicity and flexibility make it suitable for integration into existing generative pipelines, and its decomposition-agnostic design opens avenues for exploring alternative representations (e.g., wavelet, DCT, or learned multiscale autoencoders).

Potential future developments include:

Exploration of alternative decompositions to further improve spectral alignment and generative quality.
Extension to other modalities (e.g., audio, 3D data) where progressive generation is beneficial.
Automated hyperparameter tuning to further reduce the manual effort required for deployment.
Combination with parameter-efficient fine-tuning (e.g., LoRA) for rapid adaptation to new domains.

DFM's demonstrated improvements in both sample quality and convergence speed, combined with its architectural simplicity, position it as a strong candidate for next-generation progressive generative modeling in both research and production settings.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (7)

Tweets

https://twitter.com/moayedhajiali/status/1938017718093124042

https://twitter.com/moayedhajiali/status/1938015594919694726

YouTube

Show All Videos