A Practical Survey on Faster and Lighter Transformers (2103.14636v2)

Published 26 Mar 2021 in cs.LG

Abstract: Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models' efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer's limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for researchers and practitioners to determine which methods to apply in practice in order to meet the desired trade-off between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make Transformers faster and lighter and by providing a comprehensive explanation of the methods' strengths, limitations, and underlying assumptions.

PDF Abstract

An Expert Overview of "A Practical Survey on Faster and Lighter Transformers"

The paper authored by Quentin Fournier, Gaétan Marceau Caron, and Daniel Aloise provides a thorough investigation into the advancements aimed at optimizing the computational efficiency of Transformer models. The original Transformer model, introduced by Vaswani et al., revolutionized sequence-to-sequence tasks due to its ability to capture long-term dependencies using a self-attention mechanism. Despite its efficacy, the quadratic complexity of Transformers with respect to the sequence length presents a significant bottleneck, hindering scalability and practicality in resource-constrained environments.

Key Contributions and Analysis

Transformer Foundations and Challenges: The paper opens by delineating the evolution from Recurrent Neural Networks (RNNs) to Transformers, explaining how Transformers eliminated the limitations of RNNs in capturing long-range dependencies through the introduction of the attention mechanism. However, the cost of this improvement is quadratic in nature, leading to substantial resource demands, especially as sequence lengths increase.
General Methods for Efficiency: Several general techniques to reduce both computational burden and memory requirements are discussed:

- Gradient Checkpointing & Reversible Layers: These techniques allow for reduced memory consumption at the cost of additional computations by re-calculating intermediate activations during the backward pass.

- Parameter Sharing: Though this approach helps in memory reduction, it slightly diminishes model capacity due to fewer free parameters.

- Pruning and Knowledge Distillation: Both of these focus on reducing model size post-training, with pruning removing less salient weights and knowledge distillation training a smaller student model to emulate a larger teacher.

- Mixed Precision and Quantization: By reducing the precision of arithmetic operations, these techniques substantially reduce memory usage and accelerate computations, particularly on hardware optimized for lower precision operations.

Specialized Approaches to Attention: Given that the bottleneck is primarily in the attention mechanism, specialized approaches focus on optimizing this part directly:

- Sparse Attention: By selectively focusing on a subset of potential attention connections, models like the Longformer and BigBird achieve linear complexity, which is critical for long-sequence processing.

- Factorized Attention: Techniques such as the Linformer reduce attention complexity by approximating attention with linear transformations.

- Architectural Modifications: Some models, such as the Compressive Transformer, introduce additional structures like memory layers to extend the attention span without incurring the full quadratic cost.

Implications and Future Directions

The research encapsulated in this survey is crucial for expanding the applicability of Transformers across various domains requiring long-sequence processing, such as genomics and high-resolution computer vision tasks. While strides have been made, the authors highlight the necessity for:

Unified Performance Benchmarks: There's a call for standardized benchmarks to objectively compare different efficiency-boosting techniques, given that improvements can be highly task-dependent.
Hardware-Aware Sparsity Patterns: Adapting models to leverage hardware capabilities can significantly improve their practical efficiency.
Understanding Self-Attention Dynamics: Achieving more interpretable models and understanding the fundamental aspects of attention can lead to more targeted and theoretically grounded improvements.

In essence, the paper provides a foundation for researchers looking to both deepen their understanding of current optimization techniques and explore new approaches that balance performance with scalability. This comprehensive examination underlines the importance of efficiency not just for expanding the Transformer’s capabilities but also for democratizing access to powerful machine learning tools in economically constrained settings.