Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 52 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 18 tok/s Pro

GPT-5 High 13 tok/s Pro

GPT-4o 100 tok/s Pro

Kimi K2 192 tok/s Pro

GPT OSS 120B 454 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Development and Enhancement of Text-to-Image Diffusion Models (2503.05149v1)

Published 7 Mar 2025 in cs.CV and cs.AI

Abstract: This research focuses on the development and enhancement of text-to-image denoising diffusion models, addressing key challenges such as limited sample diversity and training instability. By incorporating Classifier-Free Guidance (CFG) and Exponential Moving Average (EMA) techniques, this study significantly improves image quality, diversity, and stability. Utilizing Hugging Face's state-of-the-art text-to-image generation model, the proposed enhancements establish new benchmarks in generative AI. This work explores the underlying principles of diffusion models, implements advanced strategies to overcome existing limitations, and presents a comprehensive evaluation of the improvements achieved. Results demonstrate substantial progress in generating stable, diverse, and high-quality images from textual descriptions, advancing the field of generative artificial intelligence and providing new foundations for future applications. Keywords: Text-to-image, Diffusion model, Classifier-free guidance, Exponential moving average, Image generation.

Collections

Summary

The paper explores enhancements to text-to-image diffusion models aimed at improving sample diversity and training stability.
Key methodologies include integrating Classifier-Free Guidance (CFG) to enhance image quality via conditional/unconditional text embeddings and employing Exponential Moving Average (EMA) for stable training.
Quantitative results show the enhanced model achieved a Fréchet Inception Distance (FID) of 1088.94, a significant improvement over the baseline's 1332.33, indicating better generated image quality.

This paper explores enhancements to text-to-image diffusion models, specifically addressing limitations in sample diversity and training instability.

The paper integrates Classifier-Free Guidance (CFG) to improve image quality by conditioning the model on both conditional and unconditional text embeddings, using a configurable scale factor $w$ to adjust noise predictions $\hat{\epsilon}$ as represented by the equation $\hat{\epsilon} = \epsilon_\theta(x_t, t, c) + w(\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset))$ , where $\epsilon_\theta$ is the noise prediction model, $x_t$ is the noisy image at time step $t$ , $c$ is conditional text, and $\emptyset$ is unconditional text.
Exponential Moving Average (EMA) is employed to stabilize the training process, updating EMA parameters $\theta_{EMA}$ after each training step according to the formula $\theta_{EMA} = \alpha\theta_{EMA} + (1 - \alpha)\theta$ , where $\alpha$ is the decay factor and $\theta$ represents the current model parameter.
Quantitative results using Fréchet Inception Distance (FID) demonstrate that the enhanced model achieves a score of 1088.94, a marked improvement over the baseline model's score of 1332.33, indicating enhanced image quality and realism, though the high FID scores for both models are attributed to the discrepancy between real images and creative prompts.