TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up (2102.07074v4)

Published 14 Feb 2021 in cs.CV

Abstract: The recent explosive interest on transformers has suggested their potential to become powerful "universal" models for computer vision tasks, such as classification, detection, and segmentation. While those attempts mainly study the discriminative models, we explore transformers on some more notoriously difficult vision tasks, e.g., generative adversarial networks (GANs). Our goal is to conduct the first pilot study in building a GAN completely free of convolutions, using only pure transformer-based architectures. Our vanilla GAN architecture, dubbed TransGAN, consists of a memory-friendly transformer-based generator that progressively increases feature resolution, and correspondingly a multi-scale discriminator to capture simultaneously semantic contexts and low-level textures. On top of them, we introduce the new module of grid self-attention for alleviating the memory bottleneck further, in order to scale up TransGAN to high-resolution generation. We also develop a unique training recipe including a series of techniques that can mitigate the training instability issues of TransGAN, such as data augmentation, modified normalization, and relative position encoding. Our best architecture achieves highly competitive performance compared to current state-of-the-art GANs using convolutional backbones. Specifically, TransGAN sets new state-of-the-art inception score of 10.43 and FID of 18.28 on STL-10, outperforming StyleGAN-V2. When it comes to higher-resolution (e.g. 256 x 256) generation tasks, such as on CelebA-HQ and LSUN-Church, TransGAN continues to produce diverse visual examples with high fidelity and impressive texture details. In addition, we dive deep into the transformer-based generation models to understand how their behaviors differ from convolutional ones, by visualizing training dynamics. The code is available at https://github.com/VITA-Group/TransGAN.

PDF Abstract

Overview of TransGAN: Transforming GANs with Pure Transformers

The paper "TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up" presents an innovative exploration of using transformer architectures in place of convolutional neural networks (CNNs) for generative adversarial networks (GANs). The research seeks to establish a framework for creating GANs that do not rely on convolutions, focusing entirely on transformer-based architectures. Named TransGAN, this approach delineates substantial adjustments in architecture, alongside novel training methodologies to mitigate inherent challenges.

Key Contributions

Transformer-based Architecture: The paper introduces a memory-efficient, transformer-based generator which incrementally refines feature map resolution. Correspondingly, it presents a multi-scale discriminator to effectuate a balance between capturing semantic contexts and ensuring texture detail fidelity. This shift away from convolution-heavy mechanics is further facilitated by the implementation of grid self-attention, designed to address memory constraints during high-resolution image generation.

Training Techniques: The paper also innovates on the GAN training front, introducing a unique methodology incorporating data augmentation, modified normalization processes, and relative position encoding. These enhancements address GANs' notorious training instability, especially when layered with transformer architectures.

Numerical Results

TransGAN is shown to be highly effective, evidenced by its performance on benchmark datasets. It achieves an inception score of 10.43 and FID of 18.28 on STL-10, surpassing current leading GANs using CNN backbones. Similarly, it records an inception score of 9.02 and FID of 9.26 on CIFAR-10, and a FID of 5.28 on CelebA at a 128x128 resolution. These metrics are indicative of TransGAN’s proficiency in generating diversified and high-fidelity synthetic images.

Architectural Innovations

Generator Design:
- Utilizes a stacked transformer block configuration, progressively increasing resolution.
- Incorporates grid self-attention to efficiently handle larger input sequences without overwhelming computational resources.
Discriminator Structure:
- Introduces a multi-scale approach allowing patch-level inputs of differing sizes across various stages. This design ensures extraction of both global and local textures.

Implications and Future Directions

The theoretical significance of this work lies in challenging the conventionally accepted notion that CNNs are indispensable for image generation tasks in GANs. By employing transformers, the paper opens potential avenues for exploring long-range dependencies and more global contextual relationships in image data. Practically, this transformation holds promise for scaling generative models to increasingly higher resolutions while retaining quality and computational feasibility.

Future explorations may take this foundational work further into domains requiring ultra-high-resolution synthesis or where dataset sizes are significantly larger, leveraging transformers' capability to model extensive correlations. This research—by establishing a clear method for integrating transformers into GAN frameworks—sets a precedent for leveraging modern transformer advantages in generative modeling, potentially leading to more adaptable and robust AI applications in visual synthesis.

In conclusion, this paper offers a substantial contribution to the exploration of GAN architecture, applying a purely transformer-based framework to achieve significant results. The insights gained from TransGAN not only enhance our understanding of generative models but also invite future exploration into their scalable applications across various domains.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Yifan Jiang (79 papers)
Shiyu Chang (120 papers)
Zhangyang Wang (374 papers)

Citations (354)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - VITA-Group/TransGAN: [NeurIPS‘2021] "TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up", Yifan Jiang, Shiyu Chang, Zhangyang Wang (1,682 stars)

Tweets

https://twitter.com/YifanJiang17/status/1759851681104474271

YouTube

Show All Videos