StyleSwin: Transformer-based GAN for High-resolution Image Generation (2112.10762v2)

Published 20 Dec 2021 in cs.CV

Abstract: Despite the tantalizing success in a broad of vision tasks, transformers have not yet demonstrated on-par ability as ConvNets in high-resolution image generative modeling. In this paper, we seek to explore using pure transformers to build a generative adversarial network for high-resolution image synthesis. To this end, we believe that local attention is crucial to strike the balance between computational efficiency and modeling capacity. Hence, the proposed generator adopts Swin transformer in a style-based architecture. To achieve a larger receptive field, we propose double attention which simultaneously leverages the context of the local and the shifted windows, leading to improved generation quality. Moreover, we show that offering the knowledge of the absolute position that has been lost in window-based transformers greatly benefits the generation quality. The proposed StyleSwin is scalable to high resolutions, with both the coarse geometry and fine structures benefit from the strong expressivity of transformers. However, blocking artifacts occur during high-resolution synthesis because performing the local attention in a block-wise manner may break the spatial coherency. To solve this, we empirically investigate various solutions, among which we find that employing a wavelet discriminator to examine the spectral discrepancy effectively suppresses the artifacts. Extensive experiments show the superiority over prior transformer-based GANs, especially on high resolutions, e.g., 1024x1024. The StyleSwin, without complex training strategies, excels over StyleGAN on CelebA-HQ 1024, and achieves on-par performance on FFHQ-1024, proving the promise of using transformers for high-resolution image generation. The code and models will be available at https://github.com/microsoft/StyleSwin.

PDF Abstract

StyleSwin: Transformer-Based GAN for High-Resolution Image Generation

Overview

The paper "StyleSwin: Transformer-based GAN for High-resolution Image Generation" introduces a novel approach leveraging pure transformers to develop a generative adversarial network (GAN) for high-resolution image synthesis. Traditionally, convolutional networks (ConvNets) have dominated in image generative modeling, particularly with architectures like StyleGAN. However, this paper explores the potential of transformers, particularly the Swin transformer, to enhance high-resolution image generation by overcoming the computational challenges typically associated with transformers.

Methodology

The authors present several key innovations within the StyleSwin model:

Local Attention with Swin Transformer: The use of Swin transformers allows for window-based local attention, balancing computational efficiency with modeling capacity. This approach mitigates the quadratic cost typically associated with transformer attentions, permitting scalability to resolutions as high as 1024x1024 pixels.
Double Attention Mechanism: To capture a wider context without excessive computation, a double attention mechanism is employed. It processes both local and shifted windows, expanding the transformer’s receptive field efficiently.
Style-Based Architecture: Inspired by StyleGAN, StyleSwin incorporates a style-based architecture, using the $\mathcal{W}$ space to modulate feature maps effectively, significantly enhancing generation capacity.
Local-Global Positional Encoding: The authors address limitations in positional awareness by integrating sinusoidal positional encoding alongside relative positional encoding, helping the model leverage global position information effectively.
Wavelet Discriminator: To suppress blocking artifacts in high-resolution synthesis, a wavelet discriminator examines the spectral discrepancies, effectively guiding the generator to produce artifact-free outputs.

Results

StyleSwin achieves competitive results compared to state-of-the-art GANs, particularly on high-resolution datasets. Notably, on the CelebA-HQ dataset at 1024x1024 resolution, StyleSwin surpasses StyleGAN with an FID of 4.43. The method performs competently on other datasets like FFHQ and LSUN Church, demonstrating its robustness across varied data.

Implications and Future Research

The implications of this research are significant in advancing high-resolution image generation with transformers. By integrating transformers in the generator's architecture, the model demonstrates improved expressivity and ability to capture complex dependencies over large image scales. These findings pose interesting avenues for further research, especially in exploring transformer capabilities for other generative tasks and potential optimizations for increased efficiency.

Future research could delve into enhancing the attention mechanisms for better locality and global coherence and exploring further architectural refinements to streamline transformer operations in the context of GANs. Additionally, tackling the challenges related to training dynamics and data requirements of transformers compared to ConvNets presents a valuable direction for exploration.

The integration of more sophisticated discriminators to guide artifact-free synthesis while maintaining computational feasibility could further enhance the applicability and effectiveness of transformers in generative modeling.

In summary, "StyleSwin" presents a compelling step forward in leveraging transformer architectures for high-resolution image generation, offering promising results and laying the groundwork for ongoing advancements in the field of generative adversarial networks.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Bowen Zhang (161 papers)
Shuyang Gu (26 papers)
Bo Zhang (633 papers)
Jianmin Bao (65 papers)
Dong Chen (218 papers)
Fang Wen (42 papers)
Yong Wang (498 papers)
Baining Guo (53 papers)

Citations (196)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - microsoft/StyleSwin: [CVPR 2022] StyleSwin: Transformer-based GAN for High-resolution Image Generation (486 stars)