Learning Stackable and Skippable LEGO Bricks for Efficient, Reconfigurable, and Variable-Resolution Diffusion Modeling
Abstract: Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling. While various techniques address these computational challenges, a less-explored issue is designing an efficient and adaptable network backbone for iterative refinement. Current options like U-Net and Vision Transformer often rely on resource-intensive deep networks and lack the flexibility needed for generating images at variable resolutions or with a smaller network than used in training. This study introduces LEGO bricks, which seamlessly integrate Local-feature Enrichment and Global-content Orchestration. These bricks can be stacked to create a test-time reconfigurable diffusion backbone, allowing selective skipping of bricks to reduce sampling costs and generate higher-resolution images than the training data. LEGO bricks enrich local regions with an MLP and transform them using a Transformer block while maintaining a consistent full-resolution image across all bricks. Experimental results demonstrate that LEGO bricks enhance training efficiency, expedite convergence, and facilitate variable-resolution image generation while maintaining strong generative performance. Moreover, LEGO significantly reduces sampling time compared to other methods, establishing it as a valuable enhancement for diffusion models. Our code and project page are available at https://jegzheng.github.io/LEGODiffusion.
- K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing, 54(11):4311–4322, 2006.
- Memory efficient diffusion probabilistic models via patch-based generation. arXiv preprint arXiv:2304.07087, 2023.
- Clustering with Bregman divergences. Journal of machine learning research, 6(10), 2005.
- All are worth words: A ViT backbone for score-based diffusion models. arXiv preprint arXiv:2209.12152, 2022.
- One transformer fits all distributions in multi-modal diffusion at scale. arXiv preprint arXiv:2303.06555, 2023.
- Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113, 2023.
- Modeling high-dimensional discrete data with multi-layer neural networks. Advances in Neural Information Processing Systems, 12, 1999.
- Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
- Ilvr: Conditioning method for denoising diffusion probabilistic models. arXiv preprint arXiv:2108.02938, 2021.
- Improving diffusion models for inverse problems using manifold constraints. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=nJJjv0JDJju.
- ImageNet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
- Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Continuous conditional generative adversarial networks: Novel empirical losses and label input mechanisms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Patched denoising diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2308.01316, 2023.
- NICE: Non-linear independent components estimation. International Conference in Learning Representations Workshop Track, 2015.
- Density estimation using real NVP. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=HkpbnH9lx.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
- Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106(496):1602–1614, 2011.
- Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12873–12883, 2021.
- Masked diffusion transformer is a strong image synthesizer. arXiv preprint arXiv:2303.14389, 2023.
- Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
- Vector quantized diffusion model for text-to-image synthesis. In CVPR, 2022.
- Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems, 33, 2020.
- Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23(47):1–33, 2022.
- Elucidating the design space of diffusion-based generative models. In Proc. NeurIPS, 2022.
- DiffusionCLIP: Text-guided image manipulation using diffusion models. arXiv preprint arXiv:2110.02711, 2021.
- Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
- Variational diffusion models. arXiv preprint arXiv:2107.00630, 2021.
- Glow: Generative flow with invertible 1x1 convolutions. Advances in Neural Information Processing Systems 31, pp. 10215–10224, 2018.
- On fast sampling of diffusion probabilistic models. arXiv preprint arXiv:2106.00132, 2021.
- Improved precision and recall metric for assessing generative models. Advances in Neural Information Processing Systems, 32, 2019.
- SRDiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing, 2022a.
- Diffusion-LM improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343, 2022b.
- Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730–3738, 2015.
- I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
- David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110, 2004.
- DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022.
- Improving diffusion model efficiency through patching. arXiv preprint arXiv:2207.04316, 2022.
- Calvin Luo. Understanding diffusion models: A unified perspective. arXiv preprint arXiv:2208.11970, 2022.
- Sparse representation for color image restoration. IEEE Transactions on image processing, 17(1):53–69, 2007.
- Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
- On distillation of guided diffusion models. arXiv preprint arXiv:2210.03142, 2022.
- A performance evaluation of local descriptors. IEEE transactions on pattern analysis and machine intelligence, 27(10):1615–1630, 2005.
- Improved denoising diffusion probabilistic models. arXiv preprint arXiv:2102.09672, 2021.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- DiffuseVAE: Efficient, controllable and high-fidelity generation from low-dimensional latents. arXiv preprint arXiv:2201.00308, 2022.
- Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022.
- Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125, 2022.
- Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, pp. 1278–1286, 2014.
- Herbert E Robbins. An empirical Bayes approach to statistics. In Breakthroughs in Statistics: Foundations and basic theory, pp. 388–394. Springer, 1992.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022.
- U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), volume 9351 of LNCS, pp. 234–241. Springer, 2015. URL http://lmb.informatik.uni-freiburg.de/Publications/2015/RFB15a. (available on arXiv:1505.04597 [cs.CV]).
- Photorealistic text-to-image diffusion models with deep language understanding. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=08Yk-n5l2Al.
- Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=TIdIXIpzhoI.
- Noise estimation for generative diffusion models. arXiv preprint arXiv:2104.02600, 2021.
- Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. In International Conference on Machine Learning, pp. 2256–2265, 2015.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Generative Modeling by Estimating Gradients of the Data Distribution. In Advances in Neural Information Processing Systems, pp. 11918–11930, 2019.
- Improved Techniques for Training Score-Based Generative Models. Advances in Neural Information Processing Systems, 33, 2020.
- Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=PxTIG12RRHS.
- Consistency models. arXiv preprint arXiv:2303.01469, 2023.
- DeiT III: Revenge of the ViT. arXiv preprint arXiv:2204.07118, 2022.
- Local invariant feature detectors: a survey. Foundations and trends® in computer graphics and vision, 3(3):177–280, 2008.
- RNADE: The real-valued neural autoregressive density-estimator. In Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2, pp. 2175–2183, 2013.
- Neural autoregressive distribution estimation. The Journal of Machine Learning Research, 17(1):7184–7220, 2016.
- Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34:11287–11302, 2021.
- Pixel recurrent neural networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 1747–1756. JMLR.org, 2016. URL http://dl.acm.org/citation.cfm?id=3045390.3045575.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Diffusion-GAN: Training GANs with diffusion. International Conference on Learning Representations (ICLR), 2022.
- Patch diffusion: Faster and more data-efficient training of diffusion models. arXiv preprint arXiv:2304.12526, 2023.
- Tackling the generative learning trilemma with denoising diffusion GANs. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=JprM0p-q0Co.
- Your ViT is secretly a hybrid discriminative-generative diffusion model. arXiv preprint arXiv:2208.07791, 2022.
- Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902, 2022.
- The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
- Fast sampling of diffusion models via operator learning. arXiv preprint arXiv:2211.13449, 2022a.
- Fast training of diffusion models with masked transformers. arXiv preprint arXiv:2306.09305, 2023.
- Truncated diffusion probabilistic models and diffusion-based adversarial auto-encoders. International Conference on Learning Representations (ICLR), 2022b.
- Non-parametric Bayesian dictionary learning for sparse image representations. Advances in neural information processing systems, 22, 2009.
- Beta diffusion. In Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2309.07867.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.