Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation (2402.10491v2)

Published 16 Feb 2024 in cs.CV

Abstract: Diffusion models have proven to be highly effective in image and video generation; however, they encounter challenges in the correct composition of objects when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models to higher resolution demands substantial computational and optimization resources, yet achieving generation capabilities comparable to low-resolution models remains challenging. This paper proposes a novel self-cascade diffusion model that leverages the knowledge gained from a well-trained low-resolution image/video generation model, enabling rapid adaptation to higher-resolution generation. Building on this, we employ the pivot replacement strategy to facilitate a tuning-free version by progressively leveraging reliable semantic guidance derived from the low-resolution model. We further propose to integrate a sequence of learnable multi-scale upsampler modules for a tuning version capable of efficiently learning structural details at a new scale from a small amount of newly acquired high-resolution training data. Compared to full fine-tuning, our approach achieves a $5\times$ training speed-up and requires only 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher-resolution image and video synthesis by fine-tuning for just $10k$ steps, with virtually no additional inference time.

Novel Self-Cascade Diffusion Model for Efficient High-Resolution Adaptation

Introduction

The recent developments in diffusion models have marked significant progress in the generation of high-quality images and videos. One of the critical challenges in the domain is adapting these models to generate content at higher resolutions efficiently. Full fine-tuning of large pre-trained models for higher-resolution generation results in substantial computational overhead and optimization difficulties. This paper introduces an innovative self-cascade diffusion model designed to leverage pre-existing knowledge from well-trained low-resolution models to facilitate rapid adaptation to higher-resolution tasks. The approach combines pivot-guided noise re-scheduling and time-aware feature upsampling modules, significantly enhancing the model's adaptability to higher resolutions while requiring minimal fine-tuning.

Related Work

The backdrop against which this research emerges is rich with exploration into diffusion models, noted for their effectiveness in various generative tasks. Strategies for scaling these models to higher-resolution generation often involve either extensive retraining or adopting progressive training approaches, both demanding considerable computational resources. Tuning-free methods, while reducing computational demands, often struggle with maintaining fidelity in higher resolutions. Cascading super-resolution mechanisms based on diffusion models presents another line of approach, yet these too fall short in balancing parameter efficiency with generative performance.

Methodology

The self-cascade diffusion model proposed articulates a novel structure that incorporates a pivot-guided noise re-scheduling strategy for tuning-free adaptation and later refines the output through trainable upsampler modules for higher quality. This method distinctively requires only a negligible increase in trainable parameters (0.002M) and achieves a more than 5x speed-up in training compared to full fine-tuning methods.

  • Pivot-Guided Noise Re-Schedule: At its core, this strategy employs cyclic re-utilization of the low-resolution model to generate baseline content, which is then incrementally enhanced in resolution through a sequence of multiscale upsamplers.
  • Time-Aware Feature Upsampler: For situations where tuning is acceptable for additional quality gains, the paper proposes integrating upsampler modules that adapt the features extracted by the base model to match the higher-resolution domain, guided by a minimal set of higher-quality training data.

Experimental Results

The effectiveness of the proposed method is demonstrated through extensive experiments on image and video synthesis tasks, showcasing superior performance in both tuning-free and fine-tuning settings across various resolution scales. Notably, the model achieves remarkable adaptation to higher resolutions with only a small fraction of fine-tuning steps required by conventional methods, and without a significant increase in inference time.

Implications and Future Work

The introduction of a self-cascade diffusion model represents a significant advancement in the efficient generation of high-resolution images and videos. It opens new avenues for research, particularly in exploring the balance between training efficiency and output fidelity. Future investigations could explore optimizing the architecture of time-aware upsampling modules to further reduce computational demands or extend the model's applicability to other generative tasks beyond image and video synthesis.

Conclusion

This paper sets a new benchmark in the adaptive generation of higher-resolution content from diffusion models. By strategically leveraging the capabilities of well-trained low-resolution models and introducing minimal yet effective fine-tuning mechanisms, it presents a highly efficient and scalable solution to a longstanding challenge in the field of generative models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  2. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  3. ∞\infty∞-diff: Infinite resolution diffusion with subsampled mollified states. arXiv preprint arXiv:2303.18242, 2023.
  4. Ting Chen. On the importance of noise scheduling for diffusion models. arXiv preprint arXiv:2301.10972, 2023.
  5. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  6. Stable Diffusion. Stable diffusion 2-1 base. https://huggingface.co/stabilityai/stable-diffusion-2-1-base/blob/main/v2-1_512-ema-pruned.ckpt, 2022.
  7. Matryoshka diffusion models. arXiv preprint arXiv:2310.15111, 2023.
  8. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
  9. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
  10. Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. arXiv preprint arXiv:2310.07702, 2023.
  11. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  12. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23:47–1, 2022a.
  13. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022b.
  14. simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
  15. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  16. Training-free diffusion model adaptation for variable-sized text-to-image synthesis. arXiv preprint arXiv:2306.08645, 2023.
  17. On aliased resizing and surprising subtleties in gan evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11410–11420, 2022.
  18. Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  19. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  20. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022a.
  21. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022b.
  22. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022.
  23. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
  24. Freeu: Free lunch in diffusion u-net. arXiv preprint arXiv:2309.11497, 2023.
  25. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  26. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
  27. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
  28. Dual diffusion implicit bridges for image-to-image translation. arXiv preprint arXiv:2203.08382, 2022.
  29. Relay diffusion: Unifying diffusion process across resolutions for image synthesis. arXiv preprint arXiv:2309.03350, 2023.
  30. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  31. Towards accurate generative models of video: A new metric & challenges. ICLR, 2019.
  32. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023.
  33. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
  34. Difffit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning. arXiv preprint arXiv:2304.06648, 2023.
  35. Video probabilistic diffusion models in projected latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18456–18466, 2023.
  36. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023.
  37. Any-size-diffusion: Toward efficient text-driven synthesis for any-size hd images. arXiv preprint arXiv:2308.16582, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Lanqing Guo (27 papers)
  2. Yingqing He (23 papers)
  3. Haoxin Chen (12 papers)
  4. Menghan Xia (33 papers)
  5. Xiaodong Cun (61 papers)
  6. Yufei Wang (141 papers)
  7. Siyu Huang (50 papers)
  8. Yong Zhang (660 papers)
  9. Xintao Wang (132 papers)
  10. Qifeng Chen (187 papers)
  11. Ying Shan (252 papers)
  12. Bihan Wen (86 papers)
Citations (17)