Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation (2402.10491v2)

Published 16 Feb 2024 in cs.CV

Abstract: Diffusion models have proven to be highly effective in image and video generation; however, they encounter challenges in the correct composition of objects when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models to higher resolution demands substantial computational and optimization resources, yet achieving generation capabilities comparable to low-resolution models remains challenging. This paper proposes a novel self-cascade diffusion model that leverages the knowledge gained from a well-trained low-resolution image/video generation model, enabling rapid adaptation to higher-resolution generation. Building on this, we employ the pivot replacement strategy to facilitate a tuning-free version by progressively leveraging reliable semantic guidance derived from the low-resolution model. We further propose to integrate a sequence of learnable multi-scale upsampler modules for a tuning version capable of efficiently learning structural details at a new scale from a small amount of newly acquired high-resolution training data. Compared to full fine-tuning, our approach achieves a $5\times$ training speed-up and requires only 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher-resolution image and video synthesis by fine-tuning for just $10k$ steps, with virtually no additional inference time.

PDF HTML Abstract

Novel Self-Cascade Diffusion Model for Efficient High-Resolution Adaptation

Introduction

The recent developments in diffusion models have marked significant progress in the generation of high-quality images and videos. One of the critical challenges in the domain is adapting these models to generate content at higher resolutions efficiently. Full fine-tuning of large pre-trained models for higher-resolution generation results in substantial computational overhead and optimization difficulties. This paper introduces an innovative self-cascade diffusion model designed to leverage pre-existing knowledge from well-trained low-resolution models to facilitate rapid adaptation to higher-resolution tasks. The approach combines pivot-guided noise re-scheduling and time-aware feature upsampling modules, significantly enhancing the model's adaptability to higher resolutions while requiring minimal fine-tuning.

Related Work

The backdrop against which this research emerges is rich with exploration into diffusion models, noted for their effectiveness in various generative tasks. Strategies for scaling these models to higher-resolution generation often involve either extensive retraining or adopting progressive training approaches, both demanding considerable computational resources. Tuning-free methods, while reducing computational demands, often struggle with maintaining fidelity in higher resolutions. Cascading super-resolution mechanisms based on diffusion models presents another line of approach, yet these too fall short in balancing parameter efficiency with generative performance.

Methodology

The self-cascade diffusion model proposed articulates a novel structure that incorporates a pivot-guided noise re-scheduling strategy for tuning-free adaptation and later refines the output through trainable upsampler modules for higher quality. This method distinctively requires only a negligible increase in trainable parameters (0.002M) and achieves a more than 5x speed-up in training compared to full fine-tuning methods.

Pivot-Guided Noise Re-Schedule: At its core, this strategy employs cyclic re-utilization of the low-resolution model to generate baseline content, which is then incrementally enhanced in resolution through a sequence of multiscale upsamplers.
Time-Aware Feature Upsampler: For situations where tuning is acceptable for additional quality gains, the paper proposes integrating upsampler modules that adapt the features extracted by the base model to match the higher-resolution domain, guided by a minimal set of higher-quality training data.

Experimental Results

The effectiveness of the proposed method is demonstrated through extensive experiments on image and video synthesis tasks, showcasing superior performance in both tuning-free and fine-tuning settings across various resolution scales. Notably, the model achieves remarkable adaptation to higher resolutions with only a small fraction of fine-tuning steps required by conventional methods, and without a significant increase in inference time.

Implications and Future Work

The introduction of a self-cascade diffusion model represents a significant advancement in the efficient generation of high-resolution images and videos. It opens new avenues for research, particularly in exploring the balance between training efficiency and output fidelity. Future investigations could explore optimizing the architecture of time-aware upsampling modules to further reduce computational demands or extend the model's applicability to other generative tasks beyond image and video synthesis.

Conclusion

This paper sets a new benchmark in the adaptive generation of higher-resolution content from diffusion models. By strategically leveraging the capabilities of well-trained low-resolution models and introducing minimal yet effective fine-tuning mechanisms, it presents a highly efficient and scalable solution to a longstanding challenge in the field of generative models.

PDF Markdown Bookmark Chat (Pro)

References (37)

Authors (12)

Lanqing Guo (27 papers)
Yingqing He (23 papers)
Haoxin Chen (12 papers)
Menghan Xia (33 papers)
Xiaodong Cun (61 papers)
Yufei Wang (141 papers)
Siyu Huang (50 papers)
Yong Zhang (660 papers)
Xintao Wang (132 papers)
Qifeng Chen (187 papers)
Ying Shan (252 papers)
Bihan Wen (86 papers)

Citations (17)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1759416012087140620

https://twitter.com/gm8xx8/status/1759417395498598879