StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis (2301.09515v1)

Published 23 Jan 2023 in cs.LG and cs.CV

Abstract: Text-to-image synthesis has recently seen significant progress thanks to large pretrained LLMs, large-scale training data, and the introduction of scalable model families such as diffusion and autoregressive models. However, the best-performing models require iterative evaluation to generate a single sample. In contrast, generative adversarial networks (GANs) only need a single forward pass. They are thus much faster, but they currently remain far behind the state-of-the-art in large-scale text-to-image synthesis. This paper aims to identify the necessary steps to regain competitiveness. Our proposed model, StyleGAN-T, addresses the specific requirements of large-scale text-to-image synthesis, such as large capacity, stable training on diverse datasets, strong text alignment, and controllable variation vs. text alignment tradeoff. StyleGAN-T significantly improves over previous GANs and outperforms distilled diffusion models - the previous state-of-the-art in fast text-to-image synthesis - in terms of sample quality and speed.

PDF Abstract

Overview of StyleGAN-T: A Model for Fast Large-Scale Text-to-Image Synthesis

The research paper presents StyleGAN-T, a novel generative adversarial network (GAN) model designed specifically for large-scale text-to-image synthesis. This work identifies key components that enable GANs to become competitive with diffusion and autoregressive models in text-to-image tasks, particularly emphasizing speed and sample quality.

Key Contributions

Architecture Design: StyleGAN-T builds upon StyleGAN-XL, incorporating significant modifications to address challenges in large-scale text-to-image synthesis. The model architecture has been enhanced to support a single forward pass generation, differentiating it from iterative approaches like diffusion models (DM) and autoregressive models (ARM).
Generator Enhancements: The generator has been redesigned to drop translational equivariance since it was not deemed necessary for this task. A robust combination of residual convolutions, stronger text conditioning mechanisms, and a 2nd order polynomial style transform significantly increases its expressive power, facilitating large-diversity data handling.
Discriminator Redesign: The discriminator is lightweight yet powerful, employing a self-supervised ViT-S trained with a DINO objective. With multiple discriminator heads, the architecture captures diverse image aspects effectively. Also, augmentations such as differentiable data augmentation bolster the discriminator's capability.
Guidance and Variation Management: Strategy for managing trade-offs between image variation and text alignment is central to the research. Through modified CLIP guidance and a secondary training phase, the process improves text alignment without compromising sample diversity. Additionally, explicit truncation allows for further tuning of text alignment during inference.

Results

The paper provides compelling numerical evidence showcasing StyleGAN-T's capabilities:

At 64×64 resolution, StyleGAN-T achieves a lower zero-shot Fréchet Inception Distance (FID) of 7.30, outperforming traditional GANs and aligning favorably with diffusion model performances, all while maintaining rapid inference speed (0.1 seconds per sample).
At 256×256 resolution, StyleGAN-T produces a FID of 13.90, presenting notable improvements over previous GAN attempts but still trailing behind leading diffusion models.

Implications and Future Directions

The research suggests potential pathways for GANs to reclaim a competitive position in large-scale text-to-image synthesis. The fast inference speed and smooth latent space interpolation highlight GANs as viable candidates for real-time applications, albeit current limitations on high-resolution outputs necessitate further refinements.

Key future research avenues include improving GANs' super-resolution capabilities, addressing attribute binding and text coherence via more potent LLMs, and exploring personalized model extensions similar to those developed for diffusion models. These developments can advance GANs' applicability in broader image generation tasks, maintaining their relevance alongside deterministic iterative approaches.

The findings indicate that with strategic modifications and optimizations, GANs can achieve state-of-the-art performance levels, potentially offering more efficient solutions for large-scale generative tasks.