Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis (2404.13686v3)

Published 21 Apr 2024 in cs.CV

Abstract: Recently, a series of diffusion-aware distillation algorithms have emerged to alleviate the computational overhead associated with the multi-step inference process of Diffusion Models (DMs). Current distillation techniques often dichotomize into two distinct aspects: i) ODE Trajectory Preservation; and ii) ODE Trajectory Reformulation. However, these approaches suffer from severe performance degradation or domain shifts. To address these limitations, we propose Hyper-SD, a novel framework that synergistically amalgamates the advantages of ODE Trajectory Preservation and Reformulation, while maintaining near-lossless performance during step compression. Firstly, we introduce Trajectory Segmented Consistency Distillation to progressively perform consistent distillation within pre-defined time-step segments, which facilitates the preservation of the original ODE trajectory from a higher-order perspective. Secondly, we incorporate human feedback learning to boost the performance of the model in a low-step regime and mitigate the performance loss incurred by the distillation process. Thirdly, we integrate score distillation to further improve the low-step generation capability of the model and offer the first attempt to leverage a unified LoRA to support the inference process at all steps. Extensive experiments and user studies demonstrate that Hyper-SD achieves SOTA performance from 1 to 8 inference steps for both SDXL and SD1.5. For example, Hyper-SDXL surpasses SDXL-Lightning by +0.68 in CLIP Score and +0.51 in Aes Score in the 1-step inference.

Citations (39)

View on Semantic Scholar

Summary

The paper introduces Hyper-SD, a novel framework that enhances diffusion models via trajectory segmented consistency distillation.
It integrates human feedback learning and score distillation to achieve high-quality image generation using only 1 to 8 inference steps.
Experimental results show state-of-the-art performance in aesthetic quality and textual fidelity, validated by metrics and user preference studies.

Enhancing Diffusion Model Step Efficiency through Hyper-SD, a Novel Distillation Framework

Overview of Hyper-SD

Hyper-SD introduces a novel approach that amalgamates both trajectory-preservation and trajectory-reformulation techniques within diffusion models (DMs). This unified framework leverages trajectory segmented consistency distillation (TSCD), human feedback learning, and score distillation to achieve state-of-the-art (SOTA) performances on stable-diffusion models like SDXL and SD1.5 over a reduced number of inference steps, ranging from 1 to 8.

Methodology

Hyper-SD's methodology centers on three primary enhancements to the diffusion model distillation process:

Trajectory Segmented Consistency Distillation (TSCD):
- The proposed TSCD divides the diffusion trajectory into smaller segments, facilitating a more granular and effective distillation process.
- This approach minimizes model fitting complexity, mitigating the degradation in generation quality and preserving the fidelity of the original model's trajectory across various segments.
Human Feedback Learning:
- This involves adjusting model outputs based on human aesthetic preferences and the feedback from visual perceptual models to improve the generation quality,
- The implementation uses aesthetic predictors and instance segmentation models to refine structure and aesthetic appeal, guiding the model toward producing visually pleasing and structurally coherent outputs.
Score Distillation for One-step Generation Enhancement:
- Incorporates a Distribution Matching Distillation (DMD) technique targeting enhancements specifically for one-step inference, optimizing the estimation of the score function and thus improving generation quality from minimal inference steps.

Experimental Results

Extensive experiments and a user paper were conducted, showing that Hyper-SD achieves superior performance in both aesthetic quality and textual fidelity across different diffusion model architectures:

Metrics Utilized: CLIP Score, Aesthetic Score, and specialized metrics such as ImageReward and Pickscore were used to quantitatively assess performance.
Comparison to Baselines: Hyper-SD displayed noticeable improvements over existing methods like SDXL-Lightning and various adversarial and trajectory-based distillation techniques.
User Study Findings: Hyper-SD was preferred significantly more often compared to other methods, reinforcing the effectiveness of the proposed enhancements.

Implications and Future Work

The practical implications of Hyper-SD are profound for real-world applications requiring efficient and high-quality image generation from textual prompts. The ability to operate effectively across a reduced number of inference steps without compromising output quality can lead to more resource-efficient deployments of generative models.

Looking ahead, future developments might focus on:

Maintaining Classifier Free Guidance (CFG): Ensuring the model can utilize negative prompts effectively while still functioning under accelerated conditions.
Custom Feedback Optimization: Tailoring feedback learning mechanisms specifically for accelerated models to enhance performance further.

Conclusion

Hyper-SD marks a significant advance in the field of generative AI, particularly in the optimization of diffusion models for fewer-step inference with high fidelity and aesthetic quality. It sets a new standard for efficiency in model performance, paving the way for both academic exploration and practical applications in AI-driven image generation.

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1782601752417575423

https://twitter.com/arankomatsuzaki/status/1782604806562934795

https://twitter.com/AdeenaY8/status/1782744485576716298

https://twitter.com/WilliamLamkin/status/1782761900264075329

https://twitter.com/CSVisionPapers/status/1782949332125209057

YouTube

Show All Videos