Transformer Language Models without Positional Encodings Still Learn Positional Information

Published 30 Mar 2022 in cs.CL, cs.AI, and cs.LG | (2203.16634v2)

Abstract: Causal transformer LMs, such as GPT-3, typically require some form of positional encoding, such as positional embeddings. However, we show that LMs without any explicit positional encoding are still competitive with standard models, and that this phenomenon is robust across different datasets, model sizes, and sequence lengths. Probing experiments reveal that such models acquire an implicit notion of absolute positions throughout the network, effectively compensating for the missing information. We conjecture that causal attention enables the model to infer the number of predecessors that each token can attend to, thereby approximating its absolute position. Our findings indicate that causal LMs might derive positional awareness not only from the explicit positioning mechanism, but also from the effects of the causal mask.

Abstract PDF Upgrade to Chat

Citations (92)

View on Semantic Scholar

Summary

The paper demonstrates that NoPos transformer models inherently acquire positional information through their causal attention mechanism, achieving near-parity with traditional models.
The study compares various positional encoding techniques across benchmarks like WikiText-103 and the Pile, showing that performance gaps shrink with increased model sizes and sequence lengths.
Experimental probing confirms that NoPos models depend on implicit positional cues for accurate token predictions, challenging the necessity of explicit positional encodings.

Exploring Transformer LLMs' Implicit Positional Learning Capabilities

Introduction to NoPos LLMs

Recent research has unveiled that causal transformer LMs, traditionally hinged on positional encodings for discerning the order of input tokens, demonstrate remarkable resilience and competitive performance even in their absence. In a thorough examination encompassing various datasets, model sizes, and sequence lengths, these so-called No Positional Encoding (NoPos) models not only hold their ground against traditionally engineered LMs but also reveal an intrinsic capability to grasp absolute positional information.

Methodology

The study compares NoPos models against those equipped with conventional positional encoding mechanisms—including sinusoidal, learned embeddings, and Attention with Linear Biases (ALiBi)—across benchmarks like WikiText-103 and the Pile. Surprisingly, the NoPos models exhibit minimal performance gaps when juxtaposed with models leveraging explicit positional cues, hinting at the innate adaptation prowess of the transformer architecture in handling sequential data.

In-depth Analysis

To dissect the underpinnings of positional awareness in NoPos models, the study employs probing classifiers tasked with predicting token positions based on their representations at different network depths. The findings are compelling; NoPos models attain positional prediction accuracy almost on par with that of their positionally informed counterparts. This observation stems from transformers' causal attention mechanism, which imparts an implicit understanding of sequence order by restricting each token's scope of attention to its predecessors. However, this inherent positional inference mechanism seemingly falters in models lacking a directional attention constraint, such as bidirectional models employed in tasks like masked language modeling (MLM).

Scaling Effects and Sequence Length Dynamics

Further experimentation elucidates the impact of scaling model parameters and varying input sequence lengths. Smaller NoPos models showcased some performance lag behind their positionally encoded peers; however, this gap dwindles noticeably as model size increases. Moreover, the advantage of specific positional encoding techniques like ALiBi becomes more pronounced with an elongation of sequence lengths, underlining the nuanced interplay between model size, sequence length, and positional encoding efficacy.

NoPos Models' Positional Information Utilization

A pivotal part of the study focuses on understanding how and why NoPos models leverage implicit positional information. By selectively shuffling token order in the input sequences and observing the subsequent effects on model perplexity, it becomes evident that NoPos models indeed rely on the acquired positional awareness for making accurate predictions, thereby debunking any notions of order invariance in language modeling.

Hypotheses and Future Directions

The paper posits that the autoregressive causal attention mechanism inherently equips NoPos transformer models with the ability to gauge the sequence position of tokens by recognizing the number of attendable predecessors. This insight opens intriguing avenues for further exploration, particularly in how transformers can be designed or trained to enhance this implicit positional learning capability. Moreover, the observed discrepancy in performance between NoPos causal LMs and their bidirectional MLM counterparts raises questions about the broader implications of implicit positional learning across different transformer applications.

Theoretical and Practical Implications

This study sheds light on the fundamental capabilities of transformer architectures in implicitly learning positional information, challenging the prevalent assumption about the necessity of explicit positional encodings. The practical ramifications are profound, offering a fresh perspective on model design—potentially easing the computational and memory overhead associated with positional embeddings. On the theoretical front, these findings beckon a reevaluation of our understanding of sequence modeling and representation learning within neural networks.

Concluding Remarks

The investigation into NoPos transformer LLMs divulges their adeptness at implicitly learning sequence positions, thereby recalibrating our understanding of positional encoding's role within transformer architectures. As the field of AI research continues to grapple with the intricacies of language modeling, the nuances unearthed in this study offer a promising trajectory for optimizing and rethinking model architectures moving forward.

Markdown