Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TLCM: Training-efficient Latent Consistency Model for Image Generation with 2-8 Steps (2406.05768v6)

Published 9 Jun 2024 in cs.CV and cs.AI

Abstract: Distilling latent diffusion models (LDMs) into ones that are fast to sample from is attracting growing research interest. However, the majority of existing methods face two critical challenges: (1) They hinge on long training using a huge volume of real data. (2) They routinely lead to quality degradation for generation, especially in text-image alignment. This paper proposes a novel training-efficient Latent Consistency Model (TLCM) to overcome these challenges. Our method first accelerates LDMs via data-free multistep latent consistency distillation (MLCD), and then data-free latent consistency distillation is proposed to efficiently guarantee the inter-segment consistency in MLCD. Furthermore, we introduce bags of techniques, e.g., distribution matching, adversarial learning, and preference learning, to enhance TLCM's performance at few-step inference without any real data. TLCM demonstrates a high level of flexibility by enabling adjustment of sampling steps within the range of 2 to 8 while still producing competitive outputs compared to full-step approaches. Notably, TLCM enjoys the data-free merit by employing synthetic data from the teacher for distillation. With just 70 training hours on an A100 GPU, a 3-step TLCM distilled from SDXL achieves an impressive CLIP Score of 33.68 and an Aesthetic Score of 5.97 on the MSCOCO-2017 5K benchmark, surpassing various accelerated models and even outperforming the teacher model in human preference metrics. We also demonstrate the versatility of TLCMs in applications including image style transfer, controllable generation, and Chinese-to-image generation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Qingsong Xie (16 papers)
  2. Zhenyi Liao (5 papers)
  3. Zhijie Deng (58 papers)
  4. Chen Chen (753 papers)
  5. Haonan Lu (35 papers)
Citations (2)

Summary

An Expert Review of "MLCM: Multistep Consistency Distillation of Latent Diffusion Model"

The paper "MLCM: Multistep Consistency Distillation of Latent Diffusion Model" introduces a novel approach to distilling large latent diffusion models (LDMs) into more efficient models while maintaining high-quality image synthesis. In essence, the authors propose the Multistep Latent Consistency Model (MLCM) approach, underpinned by Multistep Consistency Distillation (MCD). This method addresses significant challenges faced by existing methods, such as dependency on multiple models for different sampling budgets and quality degradation with limited sampling steps.

Core Contributions

  1. Multistep Consistency Distillation (MLCD): The paper extends MCD to representative LDMs, creating a unified model (MLCM) for various sampling steps by enforcing consistency within segmental partitions of the latent-space ODE trajectory.
  2. Progressive Training Strategy: To enhance inter-segment consistency, the paper introduces a progressive training strategy, significantly boosting the quality of few-step generations.
  3. Leveraging Teacher Model States: The authors leverage states from the teacher model's sampling trajectory, reducing the need for high-quality training datasets and aligning the training and inference phases.
  4. Human Preference Compatibility: The proposed method seamlessly integrates preference learning strategies to improve visual quality and aesthetic appeal.

Empirical Evaluation

The authors conducted a comprehensive evaluation using the MSCOCO-2017 5K benchmark, showcasing substantial performance improvements over existing methods. The key findings include the following metrics for MLCM distilled from SDXL:

  • CLIP Score: 33.30
  • Aesthetic Score: 6.19
  • Image Reward: 1.20

These improvements are significant, particularly when considering the performance of 4-step MLCM against strong baselines like 4-step LCM, 8-step SDXL-Lightning, and 8-step HyperSD.

Strong Numerical Results and Bold Claims

The paper makes several quantitative claims that challenge established baselines:

  • The streamlined MLCM generates high-quality images within 2-4 steps, surpassing methods requiring 8 steps.
  • The progressive MLCD significantly reduces segmentation errors, further enhancing generation quality.
  • Employing a better teacher model (e.g., PVXL) for trajectory estimation markedly improves MLCM's performance.

Methodological Advancements

The research presented combines theoretical and practical innovations, reinforcing its findings through methodical experimentation.

  1. Segmentation of ODE Trajectory: By dividing the latent space ODE trajectory into multiple segments, MLCM maintains high fidelity over fewer steps, mitigating error accumulation.
  2. Transition from Teacher to Student: The transition from teacher states to student model training optimizes the learning process by leveraging intermediate states from the teacher's denoising steps.
  3. Human Preference Integration: The inclusion of reward consistency and feedback learning ensures that MLCM outputs are not only technically proficient but also align with human aesthetic preferences.

Practical Implications and Future Developments

From a practical standpoint, MLCM holds the potential to enhance various applications, including controllable generation, image stylization, and Chinese-to-image generation. Given the illustrated versatility, future research could explore extending MLCM for video generation and other high-dimensional applications. Moreover, refining the one-step generation capabilities while preserving quality remains a promising avenue for subsequent investigations.

Conclusion

The paper provides a robust framework for accelerating LDMs via multistep consistency distillation, successfully addressing existing limitations. The empirical results, combined with methodological rigor, position MLCM as a notable contribution to the diffusion model landscape. Future work will likely benefit from the principles established in this paper, advancing both theoretical understanding and practical implementations in AI-driven image synthesis.