Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 52 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 454 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Development and Enhancement of Text-to-Image Diffusion Models (2503.05149v1)

Published 7 Mar 2025 in cs.CV and cs.AI

Abstract: This research focuses on the development and enhancement of text-to-image denoising diffusion models, addressing key challenges such as limited sample diversity and training instability. By incorporating Classifier-Free Guidance (CFG) and Exponential Moving Average (EMA) techniques, this study significantly improves image quality, diversity, and stability. Utilizing Hugging Face's state-of-the-art text-to-image generation model, the proposed enhancements establish new benchmarks in generative AI. This work explores the underlying principles of diffusion models, implements advanced strategies to overcome existing limitations, and presents a comprehensive evaluation of the improvements achieved. Results demonstrate substantial progress in generating stable, diverse, and high-quality images from textual descriptions, advancing the field of generative artificial intelligence and providing new foundations for future applications. Keywords: Text-to-image, Diffusion model, Classifier-free guidance, Exponential moving average, Image generation.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper explores enhancements to text-to-image diffusion models aimed at improving sample diversity and training stability.
  • Key methodologies include integrating Classifier-Free Guidance (CFG) to enhance image quality via conditional/unconditional text embeddings and employing Exponential Moving Average (EMA) for stable training.
  • Quantitative results show the enhanced model achieved a Fréchet Inception Distance (FID) of 1088.94, a significant improvement over the baseline's 1332.33, indicating better generated image quality.

This paper explores enhancements to text-to-image diffusion models, specifically addressing limitations in sample diversity and training instability.

  • The paper integrates Classifier-Free Guidance (CFG) to improve image quality by conditioning the model on both conditional and unconditional text embeddings, using a configurable scale factor ww to adjust noise predictions ϵ^\hat{\epsilon} as represented by the equation ϵ^=ϵθ(xt,t,c)+w(ϵθ(xt,t,c)ϵθ(xt,t,))\hat{\epsilon} = \epsilon_\theta(x_t, t, c) + w(\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset)), where ϵθ\epsilon_\theta is the noise prediction model, xtx_t is the noisy image at time step tt, cc is conditional text, and \emptyset is unconditional text.
  • Exponential Moving Average (EMA) is employed to stabilize the training process, updating EMA parameters θEMA\theta_{EMA} after each training step according to the formula θEMA=αθEMA+(1α)θ\theta_{EMA} = \alpha\theta_{EMA} + (1 - \alpha)\theta, where α\alpha is the decay factor and θ\theta represents the current model parameter.
  • Quantitative results using Fréchet Inception Distance (FID) demonstrate that the enhanced model achieves a score of 1088.94, a marked improvement over the baseline model's score of 1332.33, indicating enhanced image quality and realism, though the high FID scores for both models are attributed to the discrepancy between real images and creative prompts.

Authors (1)