A Simple and Effective Positional Encoding for Transformers (2104.08698v2)

Published 18 Apr 2021 in cs.CL and cs.LG

Abstract: Transformer models are permutation equivariant. To supply the order and type information of the input tokens, position and segment embeddings are usually added to the input. Recent works proposed variations of positional encodings with relative position encodings achieving better performance. Our analysis shows that the gain actually comes from moving positional information to attention layer from the input. Motivated by this, we introduce Decoupled Positional Attention for Transformers (DIET), a simple yet effective mechanism to encode position and segment information into the Transformer models. The proposed method has faster training and inference time, while achieving competitive performance on GLUE, XTREME and WMT benchmarks. We further generalize our method to long-range transformers and show performance gain.

PDF Abstract

Analysis of Decoupled Positional Attention in Transformers

This paper presents an in-depth exploration of positional encoding in Transformer models, challenging conventional approaches and offering a novel method termed Decoupled posItional attEntion for Transformers (DIET). Traditional transformers, such as BERT, rely heavily on the permutation-equivariant nature of self-attention layers, necessitating positional embeddings to provide order sensitivity to input sequences. Conventionally, these embeddings have been absolute, fixed-length encodings added directly to the input layer. Recent developments favor relative position encodings, which have demonstrated noticeable improvements in model performance. The paper systematically investigates the mechanisms behind these improvements and introduces a simpler, computationally efficient positional encoding strategy that achieves competitive results across different NLP benchmarks.

Core Contributions

The key contributions of this research include:

Theoretical Examination: A rigorous theoretical analysis underscores the limitations of additive position embeddings at the input layer. The authors argue that this method constrains the rank of attention matrices, limiting the model's representational power. Through detailed proofs, they illustrate that embedding positional information directly into attention matrices—per-head—offers significantly higher rank and, thus, increased representational capability.
Proposed Positional Encoding Methods: The paper introduces two innovative encoding techniques:
- Diet-Abs: A per-head absolute position attention method that decouples position embeddings from input tokens, allowing for high-rank attention matrices.
- Diet-Rel: A simplified variant of relative positional attention that similarly places positional information per-head, incorporating efficient segment encodings within the attention layer.

Both methods avoid the complexity and computational overhead associated with previous approaches, offering faster training and inference times.

Practical Insights and Implications: By moving positional embeddings from input to per-head, the paper demonstrates enhancements in performance metrics across several established benchmarks, including GLUE and XTREME. Particularly, Diet-Abs shows notable improvement despite its absolute nature, traditionally underestimated when compared to relative methods.

Experimental Validation

The paper provides compelling experimental evidence supporting the proposed approaches:

GLUE Benchmark: When applied to English transfer learning tasks, both Diet-Rel and Diet-Abs outperform the baseline BERT model incorporating input embeddings. Segment features moved to per-head also yield increased performance, corroborating the theoretical insights.
XTREME Benchmark: In multilingual settings, distinct improvements are visible when employing Diet-Abs, surpassing previous state-of-the-art methods. The experiments emphasize the regularization effect of parameter sharing across layers and heads, enhancing absolute positional encoding stability and performance.
Machine Translation: The methods were also tested on tasks involving machine translation. The per-head positional methods consistently provided superior results compared to traditional input-level embeddings.

Future Directions

The paper opens avenues for further investigation into:

Long-range Transformers: Considering the efficiency gains demonstrated, future research might focus on integrating DIET methods with long-sequence models, exploring their scalability and adaptability across varied input lengths.
Cross-layer Parameter Sharing: An exploration into advanced sharing mechanisms and their impact on model capacity could yield refined Transformer architectures with minimized computational footprints yet maximal expressive power.

Ultimately, this research challenges the established paradigms of position encoding in Transformers, offering analytically sound and practically proven alternatives that promise to influence future developments in AI model design.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Pu-Chin Chen (1 paper)
Henry Tsai (5 papers)
Srinadh Bhojanapalli (44 papers)
Hyung Won Chung (30 papers)
Yin-Wen Chang (4 papers)
Chun-Sung Ferng (8 papers)

Citations (49)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos