A Survey of Transformers (2106.04554v2)

Published 8 Jun 2021 in cs.LG, cs.AI, and cs.CL

Abstract: Transformers have achieved great success in many artificial intelligence fields, such as natural language processing, computer vision, and audio processing. Therefore, it is natural to attract lots of interest from academic and industry researchers. Up to the present, a great variety of Transformer variants (a.k.a. X-formers) have been proposed, however, a systematic and comprehensive literature review on these Transformer variants is still missing. In this survey, we provide a comprehensive review of various X-formers. We first briefly introduce the vanilla Transformer and then propose a new taxonomy of X-formers. Next, we introduce the various X-formers from three perspectives: architectural modification, pre-training, and applications. Finally, we outline some potential directions for future research.

PDF Abstract

An Overview of "A Survey of Transformers" by Tianyang Lin et al.

The paper "A Survey of Transformers" authored by Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu, presents a comprehensive review of Transformer models and their variants. Initially proposed for machine translation, the Transformer architecture has evolved to become a cornerstone in diverse artificial intelligence fields, including NLP, computer vision (CV), and audio processing. This paper systematically categorizes and analyzes the extensive range of Transformer variants, termed X-formers, to provide insights into their architectural modifications, pre-training methodologies, and applications.

Architectural Modifications of Transformers

Overview of Vanilla Transformer

The vanilla Transformer model comprises an encoder-decoder structure, each consisting of a stack of identical layers featuring multi-head self-attention and position-wise feed-forward networks (FFNs). The key components, such as attention modules, residual connections, layer normalization, and position encoding, facilitate the model's capacity to capture dependencies across tokens effectively.

Enhancements in Attention Mechanism

Given the quadratic complexity of self-attention with respect to sequence length, several enhancements have been proposed to improve efficiency and model generalization:

Sparse Attention: Sparse attention mechanisms, including position-based and content-based sparsity, reduce computation complexity by limiting the number of query-key pairs. Techniques like Star-Transformer, Longformer, ETC, and BigBird employ a combination of band and global attention patterns.
Linearized Attention: Linearized attention approaches, such as Linear Transformer and Performer, approximate the attention matrix with kernel feature maps, leading to linear complexity.
Prototype and Memory Compression: Methods like Clustered Attention and Linformer reduce the number of queries or key-value pairs to achieve more efficient attention computation.
Low-Rank Self-Attention: Exploiting the low-rank property of attention matrices, works like CSALR and Nyströmformer leverage low-rank approximations to enhance efficiency.
Attention with Prior: Integrating prior knowledge into attention mechanisms, approaches like Gaussian Transformer and Predictive Attention Transformer enhance attention distribution by incorporating structural biases or previous attention maps.
Improved Multi-Head Mechanism: Head behavior modeling and refined aggregation techniques, such as dynamic routing and collaborative multi-head attention, improve the effectiveness of multi-head attention mechanisms.

Other Module-Level Modifications

Beyond the attention mechanism, significant modifications have been proposed for other components of the Transformer model:

Position Representations: Methods like relative position encoding and implicit position representations (e.g., Roformer) enhance the positional awareness of the model.
Layer Normalization: Variants like pre-LN and substitutes like PowerNorm address the stability and efficiency issues associated with layer normalization.
Position-wise FFN: Enhancements in FFNs include exploring alternative activation functions (e.g., Swish, GELU), scaling capacity via mixture-of-experts (e.g., Switch Transformer), and even approaches that drop FFNs entirely, relying on augmented attention instead.

Architecture-Level Variants

Researchers have proposed modifications at a higher architectural level to further enhance Transformer models:

Lightweight Transformer Variants: Models such as Lite Transformer, Funnel Transformer, and DeLighT aim to reduce computation and memory overheads while maintaining or improving performance.
Enhanced Cross-Block Connectivity: Approaches like Realformer and Transparent Attention facilitate better information flow across deeper layers and between encoder-decoder blocks, addressing optimization challenges.
Adaptive Computation Time: Techniques such as Universal Transformer and early exit mechanisms enable dynamic computation conditioned on input complexity, improving efficiency and handling varying input difficulties.
Divide-and-Conquer Strategies: Recurrent Transformers (e.g., Transformer-XL, Compressive Transformer) and hierarchical Transformers (e.g., HIBERT, TNT) effectively manage long sequences by breaking down inputs into manageable segments.

Pre-Training and Applications

Pre-trained Transformers (PTMs) have been pivotal in extending the usability of Transformer models across various tasks:

Pre-Training Approaches: Encoder-only models (e.g., BERT), decoder-only models (e.g., GPT-3), and encoder-decoder models (e.g., T5) have demonstrated significant improvements in downstream NLP tasks through extensive self-supervised pre-training on large corpora.
Applications Across Domains: Transformers have been successfully applied in NLP, CV, audio processing, and multimodal applications, illustrating their versatility and efficacy.

Implications and Future Directions

Practical Implications

The advancements in Transformer architectures have significantly impacted practical applications across various AI fields. Efficient models with enhanced generalization abilities are crucial for deploying Transformers in resource-constrained environments. Furthermore, adaptations for multimodal data integration open possibilities for developing unified frameworks capable of handling diverse data types.

Theoretical Implications

The theoretical understanding of Transformer's capacity and effectiveness remains an open area for research. Addressing this gap could provide deeper insights into the underlying mechanisms that make Transformer models outperform traditional architectures.

Speculative Future Developments

Future research may explore:

Theoretical Analyses: Investigating the theoretical underpinnings of Transformer models to explain their superior performance comprehensively.
Novel Interaction Mechanisms: Developing new or hybrid mechanisms for global information propagation beyond self-attention.
Unified Multimodal Frameworks: Creating frameworks capable of seamless multimodal data integration, leveraging Transformer's flexibility.

In conclusion, the paper provides a thorough examination of the Transformer architecture's evolution, highlighting key improvements and their implications. Continued innovation and theoretical advancements will likely further cement Transformer's role as a fundamental model in AI research and applications.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Tianyang Lin (2 papers)
Yuxin Wang (132 papers)
Xiangyang Liu (23 papers)
Xipeng Qiu (257 papers)

Citations (907)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

HackerNews

A Survey of Transformers (3 points, 0 comments)