ViTGAN: Training GANs with Vision Transformers (2107.04589v2)

Published 9 Jul 2021 in cs.CV, cs.LG, and eess.IV

Abstract: Recently, Vision Transformers (ViTs) have shown competitive performance on image recognition while requiring less vision-specific inductive biases. In this paper, we investigate if such performance can be extended to image generation. To this end, we integrate the ViT architecture into generative adversarial networks (GANs). For ViT discriminators, we observe that existing regularization methods for GANs interact poorly with self-attention, causing serious instability during training. To resolve this issue, we introduce several novel regularization techniques for training GANs with ViTs. For ViT generators, we examine architectural choices for latent and pixel mapping layers to facilitate convergence. Empirically, our approach, named ViTGAN, achieves comparable performance to the leading CNN-based GAN models on three datasets: CIFAR-10, CelebA, and LSUN bedroom.

PDF Abstract

Analysis of ViTGAN: Training GANs with Vision Transformers

The paper presents an innovative integration of Vision Transformers (ViTs) into Generative Adversarial Networks (GANs), termed ViTGAN. This integration is proposed on the basis of the competitive performance demonstrated by ViTs in image recognition tasks, aiming to explore their potential in image generation.

Contributions and Methodology

The principal contribution of this paper lies in its novel integration of the Vision Transformer architecture into GANs, resulting in the ViTGAN model. The authors identify a significant challenge in this endeavor: the instability of GAN training when combined with the self-attention mechanism inherent in ViTs. Traditional regularization techniques, effective in CNN-based GANs, do not suffice to stabilize training in the transformer-based approach. To address this, the authors introduce new regularization methodologies specifically designed for ViTGAN.

Key techniques proposed include:

Revisiting Lipschitz Regularity: The authors enhance spectral normalization, a known technique in GAN training, by incorporating it into a modified framework that accounts for the peculiarities of self-attention mechanisms.
Self-Modulated Layer Normalization: In the generator, the authors introduce a form of layer normalization modulated by the latent space representation, which helps in stabilizing the adversarial training dynamics.
Utilization of Implicit Neural Representation: This involves using Fourier Features to facilitate a smoother mapping from latent embeddings to output image patches.

These methodological advances allow ViTGAN to match the performance of state-of-the-art CNN-based GANs like StyleGAN2 across established benchmarks such as CIFAR-10, CelebA, and LSUN bedroom datasets.

Experimental Results

Empirically, ViTGAN demonstrates substantial performance. The experiments show that ViTGAN outperforms other transformer-based GAN models by a significant margin, achieving FID scores competitive with those of prevailing CNN-based models. Specifically, ViTGAN exhibits comparable performance to StyleGAN2 without employing convolutions or pooling, which underscores the potential of transformer architectures for high-quality image synthesis.

Implications and Future Prospects

The successful application of ViTs in GANs opens up new prospects in neural network design, challenging the convolutional paradigm that has dominated computer vision tasks. The paper's findings suggest that transformers, even without convolution and pooling operations, can effectively capture the generative process, leveraging global contextualization mechanisms.

The potential for future research is vast. ViTGAN could catalyze further investigations into extending transformer-based architectures across diverse synthesis tasks, including video generation and other large-scale image datasets. Additionally, the development of more advanced regularization techniques tailored for transformers in generative models may further alleviate stability issues and enhance performance.

In conclusion, ViTGAN signifies a promising expansion of the application space for transformers, pushing the boundaries of what is achievable in generative modeling. As the machine learning community continues to explore the intersections of various architectures, ViTGAN stands as a testament to the ongoing evolution and sophistication of deep learning models.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Kwonjoon Lee (23 papers)
Huiwen Chang (28 papers)
Lu Jiang (90 papers)
Han Zhang (338 papers)
Zhuowen Tu (80 papers)
Ce Liu (51 papers)

Citations (166)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/CIGX/status/1796158539037024352