Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Axial Attention in Multidimensional Transformers (1912.12180v1)

Published 20 Dec 2019 in cs.CV

Abstract: We propose Axial Transformers, a self-attention-based autoregressive model for images and other data organized as high dimensional tensors. Existing autoregressive models either suffer from excessively large computational resource requirements for high dimensional data, or make compromises in terms of distribution expressiveness or ease of implementation in order to decrease resource requirements. Our architecture, by contrast, maintains both full expressiveness over joint distributions over data and ease of implementation with standard deep learning frameworks, while requiring reasonable memory and computation and achieving state-of-the-art results on standard generative modeling benchmarks. Our models are based on axial attention, a simple generalization of self-attention that naturally aligns with the multiple dimensions of the tensors in both the encoding and the decoding settings. Notably the proposed structure of the layers allows for the vast majority of the context to be computed in parallel during decoding without introducing any independence assumptions. This semi-parallel structure goes a long way to making decoding from even a very large Axial Transformer broadly applicable. We demonstrate state-of-the-art results for the Axial Transformer on the ImageNet-32 and ImageNet-64 image benchmarks as well as on the BAIR Robotic Pushing video benchmark. We open source the implementation of Axial Transformers.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jonathan Ho (27 papers)
  2. Nal Kalchbrenner (27 papers)
  3. Dirk Weissenborn (17 papers)
  4. Tim Salimans (46 papers)
Citations (481)

Summary

  • The paper presents axial attention as its main contribution, reducing computational costs while preserving full distribution expressiveness in multidimensional transformers.
  • It introduces a novel architecture with unmasked and masked blocks that enables parallel computation and efficient semi-parallel sampling.
  • Experimental results on ImageNet and BAIR benchmarks confirm the model’s superior performance in both image and video generative tasks.

Axial Attention in Multidimensional Transformers

The paper introduces the concept of Axial Transformers, a novel self-attention-based autoregressive model designed for data structured as high-dimensional tensors, such as images and videos. The authors address the limitations of existing autoregressive models, which often require significant computational resources or compromise on expressiveness and implementation complexity.

Key Contributions

The core innovation in this work is the use of axial attention, which applies attention mechanisms along individual axes of a tensor rather than a flattened sequence of elements. This architectural refinement significantly reduces computational demands, resulting in a model that maintains full distribution expressiveness without imposing additional implementation burdens.

Architectural Design

  • Axial Attention: By focusing on individual axes, axial attention offers computational and memory efficiencies, saving a factor of O(N(d1)/d)O(N^{(d-1)/d}) over traditional self-attention for dd-dimensional tensors.
  • Model Structure: The Axial Transformer architecture includes both unmasked and masked attention blocks, enabling the model to capture dependencies across an entire dataset while maintaining parallel computation capabilities.
  • Sampling Efficiency: The model supports a semi-parallel structure for sampling, significantly reducing the time complexity in generating samples compared to previous methods.

Experimental Results

The Axial Transformers demonstrated state-of-the-art performance across various benchmarks:

  • ImageNet Performance: Achieved superior performance on ImageNet-32 and ImageNet-64 datasets, with results corroborated by high-quality and globally coherent image samples.
  • Video Modeling: On the BAIR Robotic Pushing benchmark, the model outperformed previous approaches, affirming its capability in modeling complex video datasets without architecture modifications designed specifically for video.

Implications and Future Directions

Axial Transformers offer a compelling balance between computational efficiency and modeling power, suggesting broad applicability across domains requiring high-dimensional data processing. Practically, the model’s reduction in computational resource requirements facilitates more accessible deployment on standard hardware platforms, such as GPUs and TPUs.

Theoretically, the introduction of axial attention provides a versatile framework that could inspire future research into multidimensional data processing architectures. Exploring modifications and enhancements to axial attention, incorporating additional modalities, or extending to different types of generative tasks, could further enhance its applicability and performance.

Conclusion

This paper provides a detailed exploration of axial attention as a mechanism to optimize the performance of autoregressive models on multidimensional tensors. By integrating axial attention into the transformer framework, the authors address critical resource constraints without sacrificing the model's expressiveness or ease of implementation. This advancement opens avenues for both practical applications and future research developments in the domain of high-dimensional generative modeling.

Youtube Logo Streamline Icon: https://streamlinehq.com