- The paper presents a novel Group-wise MultiScale Deformable Convolution (MSDCN) block that decouples scale and direction to stabilize long-range dependency learning.
- It demonstrates that the FlowDCN architecture achieves state-of-the-art image quality and efficiency on benchmarks like CIFAR-10 and ImageNet with fewer parameters and faster convergence.
- The model employs an effective Scale Adjustment technique for arbitrary resolutions and optimizes DCN performance with efficient CUDA and Triton-lang implementations.
This paper introduces FlowDCN, a novel convolutional generative model designed for efficient, high-quality image generation at arbitrary resolutions (FlowDCN: Exploring DCN-like Architectures for Fast Image Generation with Arbitrary Resolution, 30 Oct 2024). It addresses the limitations of existing transformer-based diffusion models (like DiT), which suffer from quadratic computational complexity (O(n2)) and difficulties in extrapolating to resolutions unseen during training, often due to absolute position embeddings.
The core innovation is the Group-wise MultiScale Deformable Convolution (MSDCN) block. This block modifies standard deformable convolution (DCN) in two key ways:
- Decoupling Scale and Direction: Instead of directly predicting the 2D offset vector Δpk, the model predicts a learnable relative scale s(x) and a direction (pk+Δpk(x)). The final sampling point p is calculated as p=p0+s(x)⋅(pk+Δpk(x)), where p0 is the center pixel and pk is a fixed base offset prior. This decoupling aims to stabilize the learning of offsets, especially for long-range dependencies.
- Group-wise Multiscale Priors: Different groups (g) within the convolution layer are assigned different scale priors (s0g). This allows some groups to focus on local features (small scale priors) while others capture long-range dependencies (large scale priors) within the same layer, enhancing the model's flexibility without needing extensive downsampling which can lose high-frequency details. The scale priors are initialized logarithmically to cover a range of scales linearly across groups.
The FlowDCN architecture stacks these MSDCN blocks. It deliberately omits the U-Net-like long skip connections found in many diffusion models to create a purely DCN-based architecture. It uses patchification for input, AdaLN-Zero for time and class conditioning, SwiGLU activation functions, and RMSNorm for normalization. The model is trained using a linear-based Flow Matching objective, which directly regresses the velocity field between noise and data samples.
For arbitrary resolution sampling, the paper proposes a simple technique called Scale Adjustment. Since the learned scale s(x) is implicitly tied to the training resolution (Htrain×Wtrain), generating images at a different resolution (Htest×Wtest) can lead to suboptimal receptive fields and affect global consistency. Scale Adjustment modifies the maximum scale hyperparameter (Smax) during inference by multiplying it with the relative aspect ratio between the test and train resolutions (e.g., Smax,h=Smax⋅HtrainHtest).
Implementation Details:
- To overcome the latency of standard DCN implementations on smaller feature maps common in diffusion backbones, the authors developed efficient CUDA kernels using shared memory (DeformConv(shm)) and Triton-lang (DeformConv(Triton-lang)).
Experiments and Results:
- CIFAR-10 (32x32): FlowDCN significantly outperforms the SiT baseline (5.47 vs 7.42 FID). Ablations confirm the benefit of the multiscale design, learnable relative scales, and fixed direction priors.
- ImageNet (256x256): FlowDCN models (S, B, L, XL) consistently outperform comparable DiT and SiT models across metrics (FID, sFID, IS) while using ~8% fewer parameters and ~20% fewer FLOPs. They achieve faster convergence, reaching strong results with only ~1/5th the training images. FlowDCN-XL/2 achieves SOTA 4.30 sFID and 2.13 FID (with CFG). Visually, FlowDCN produces clearer images than SiT, especially at very few sampling steps (2-10).
- ImageNet (512x512): Fine-tuning the 256x256 model yields strong results (2.44 FID, 4.53 sFID), competitive with SOTA methods.
- Arbitrary Resolution: Tested on 320x320 and 224x448 using models trained only on 256x256. FlowDCN achieves strong results, particularly on the 320x320 resolution, outperforming DiT and FiT variants without specialized extrapolation training. Scale Adjustment consistently improves visual quality, though metric improvements vary. Appendix results show further gains when using variable aspect ratio training.
Conclusion: FlowDCN presents a promising purely convolutional architecture for image generation. It leverages the proposed MSDCN block and flow matching to achieve state-of-the-art results on ImageNet, particularly in sFID, with improved efficiency (parameters, FLOPs, convergence speed) compared to transformer-based models. It demonstrates strong native capabilities for arbitrary-resolution generation, further enhanced by the simple Scale Adjustment technique.