Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Image Transformer (1802.05751v3)

Published 15 Feb 2018 in cs.CV

Abstract: Image generation has been successfully cast as an autoregressive sequence generation or transformation problem. Recent work has shown that self-attention is an effective way of modeling textual sequences. In this work, we generalize a recently proposed model architecture based on self-attention, the Transformer, to a sequence modeling formulation of image generation with a tractable likelihood. By restricting the self-attention mechanism to attend to local neighborhoods we significantly increase the size of images the model can process in practice, despite maintaining significantly larger receptive fields per layer than typical convolutional neural networks. While conceptually simple, our generative models significantly outperform the current state of the art in image generation on ImageNet, improving the best published negative log-likelihood on ImageNet from 3.83 to 3.77. We also present results on image super-resolution with a large magnification ratio, applying an encoder-decoder configuration of our architecture. In a human evaluation study, we find that images generated by our super-resolution model fool human observers three times more often than the previous state of the art.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Niki Parmar (17 papers)
  2. Ashish Vaswani (23 papers)
  3. Jakob Uszkoreit (23 papers)
  4. Noam Shazeer (37 papers)
  5. Alexander Ku (15 papers)
  6. Dustin Tran (54 papers)
  7. Łukasz Kaiser (17 papers)
Citations (1,574)

Summary

The Image Transformer: A Technical Overview

The paper "Image Transformer" by Parmar et al. explores the extension of Transformer models, originally designed for sequence modeling tasks in NLP, to the domain of image generation. The authors propose an adaptation that leverages self-attention mechanisms with a focus on local neighborhoods, significantly elevating the capabilities of generative models in image synthesis and super-resolution tasks.

Background and Motivation

The research addresses limitations in existing autoregressive models such as PixelRNN and PixelCNN. While PixelRNN offers a robust framework for image generation by modeling each pixel’s distribution conditioned on previously generated pixels, its sequential nature incurs substantial computational costs. PixelCNN, though more parallelizable, suffers from a limited receptive field, often necessitating a higher number of parameters to capture long-range dependencies effectively. The Image Transformer mitigates such inefficiencies, achieving a larger receptive field without a proportional increase in parameter count.

Model Architecture

The core innovation lies in the employment of local self-attention mechanisms tailored for image data. The model partitions the image into blocks, where each block attends to a localized neighborhood, ensuring computational feasibility while maintaining substantial receptive fields. This setup benefits from the parallelizable nature of self-attention layers, thereby offering computational efficiency comparable to CNNs, but with enhanced capability to model long-range dependencies akin to RNNs.

Self-Attention Mechanism

The model applies multi-head self-attention in a restricted, localized manner. Each pixel’s representation (or channel's representation) is computed by attending to a small neighborhood in the image—termed as the memory block. The process involves computing a weighted sum over linearly transformed representations of the neighborhood pixels, akin to gated convolutions but with dynamic weighting schemes.

Loss Function

The generative model employs maximum likelihood estimation (MLE) for training, modeling pixel intensities via categorical distributions or discretized logistic mixture likelihoods (DMOL). The latter efficiently captures the ordinal nature of pixel values while reducing parameter count, enabling a denser and more effective gradient flow during optimization.

Experiments and Results

The authors demonstrate the Image Transformers’ efficacy across several tasks:

  • Unconditional Image Generation: On the CIFAR-10 and ImageNet datasets, the Image Transformer achieves superior performance compared to PixelCNN and PixelRNN, measured in bits per dimension (bpd). Notably, the model attains a new state-of-the-art of 3.77 bpd on ImageNet, showcasing its robust generative capabilities.
  • Conditional Image Generation: When conditioned on class labels in CIFAR-10, the perceptual quality of generated images is significantly higher than that of unconditioned models, reflecting the model's ability to leverage conditional embeddings successfully.
  • Image Super-Resolution: In the challenging task of 4x super-resolution, the Image Transformer in an encoder-decoder setup performs impressively. On the CelebA dataset, human evaluators are fooled into believing generated images are real 36.11% of the time, which is a substantial improvement over previous methods.

Implications and Future Directions

The successful adaptation of self-attention to image generation opens new avenues for research. The Image Transformer not only excels in existing tasks like image generation and super-resolution but also holds potential for integrating diverse conditioning information, such as free-form text for tasks involving visual and textual data. The flexibility and efficiency of the self-attention mechanism make it a compelling candidate for video modeling and applications in model-based reinforcement learning.

Conclusion

The Image Transformer represents a significant contribution to the field of image generation by leveraging self-attention mechanisms within a localized context. Its ability to generalize across various tasks, coupled with improved efficiency and scalability, marks a noteworthy advancement over traditional CNN and RNN based approaches. Future research can build on this foundation to explore multi-modal learning and real-time video synthesis, pushing the boundaries of what is achievable with generative models in computer vision.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com