Generating Long Sequences with Sparse Transformers (1904.10509v1)

Published 23 Apr 2019 in cs.LG and stat.ML

Abstract: Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more.

PDF Abstract

Introduction

In their contribution to the field, Child et al. introduce several enhancements to the Transformer architecture with the intent of addressing the limitations imposed by sequence length on time and memory resources. The seminal model, although proficient in a range of sequence modeling tasks, experiences a quadratically increasing computational burden with lengthier sequences. The proposed modifications by Child et al. include sparse factorizations of the attention matrix and supporting innovations to effectively model sequences of substantial length.

Related Work and Background

The paper situates itself within the landscape of neural autoregressive models which excel in modeling high-dimensional data distributions inherent to text, audio, and visual data sequences. Although these models, including CNN-based architectures and WaveNet, exhibit notable capabilities, their success comes with a steep computational price. The Transformer architecture by Vaswani et al., though outperforming in natural language tasks, is similarly demanding in resources when faced with long sequences. Child et al.'s work is primarily focused on curtailing the memory and computational requirements of the Transformer for extended sequences without sacrificing performance.

Sparse Transformer Architecture

The Sparse Transformer, as coined by the authors, incorporates several key elements: a restructured residual block, sparse attention kernels for efficient computation, and recomputation of attention weights during backpropagation to save on memory. These changes accommodate the model's training on sequences tens of thousands of timesteps long. Furthermore, the paper introduces a method for implementing attention over two dimensions, suitable for structures where data possesses a two-dimensional geometry, such as images or waveforms.

Experiments and Results

The authors empirically validate their architecture across a variety of data types, including natural language, raw audio, and images. Their Sparse Transformers achieve state-of-the-art performance in density modeling tasks and generate samples that showcase global coherence over extended sequence lengths. Notably, the experiments demonstrate that models equipped with sparse patterns not only improve computational efficiency but also tend to converge to lower error rates, potentially implying a fruitful inductive bias or hinting at optimization challenges inherent to dense attention models.

Conclusion

Child et al.'s Sparse Transformers present a significant advancement in efficiently modeling long sequences, delivering state-of-the-art results across multiple types of data. They address core issues in the Transformer's scalability and demonstrate the potential of self-attention in effectively managing sequences with over a million elements. These achievements mark an impactful stride in the area of large-scale sequence modeling, paving the way for future explorations of even longer and more complex datasets.