Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rethinking Positional Encoding in Language Pre-training (2006.15595v4)

Published 28 Jun 2020 in cs.CL and cs.LG

Abstract: In this work, we investigate the positional encoding methods used in language pre-training (e.g., BERT) and identify several problems in the existing formulations. First, we show that in the absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations between the two heterogeneous information resources. It may bring unnecessary randomness in the attention and further limit the expressiveness of the model. Second, we question whether treating the position of the symbol \texttt{[CLS]} the same as other words is a reasonable design, considering its special role (the representation of the entire sentence) in the downstream tasks. Motivated from above analysis, we propose a new positional encoding method called \textbf{T}ransformer with \textbf{U}ntied \textbf{P}ositional \textbf{E}ncoding (TUPE). In the self-attention module, TUPE computes the word contextual correlation and positional correlation separately with different parameterizations and then adds them together. This design removes the mixed and noisy correlations over heterogeneous embeddings and offers more expressiveness by using different projection matrices. Furthermore, TUPE unties the \texttt{[CLS]} symbol from other positions, making it easier to capture information from all positions. Extensive experiments and ablation studies on GLUE benchmark demonstrate the effectiveness of the proposed method. Codes and models are released at https://github.com/guolinke/TUPE.

An Overview of "Rethinking Positional Encoding in Language Pre-training"

In this paper, Ke et al. present a paper on the positional encoding methods utilized in language pre-training models such as BERT, challenging existing approaches and proposing a new method termed Transformer with Untied Positional Encoding (TUPE). The paper critically analyses both absolute and relative positional encoding in common LLMs, identifying key limitations that constrain expressiveness and efficiency.

Critique of Existing Positional Encoding

The authors begin by examining the absolute positional encoding used in Transformers. In traditional settings, positional embeddings are added to the word embeddings at the input stage, which can introduce heterogeneous and potentially noisy correlations between positional information and word semantics. This operation may introduce randomness that limits the attention module's performance. Furthermore, Ke et al. question the attention bias towards the first few words when the [CLS] token, a special token used to represent whole sentences in some downstream tasks, is treated similarly to ordinary tokens. The authors argue that this practice can misalign the model's focus and harm sentence comprehension.

Introduction of TUPE

To address these shortcomings, TUPE offers a novel approach where the positional and word correlations are computed separately with distinct parameterizations and later combined. By untying these correlations using different projection matrices, TUPE removes the unintended interactions between positional and word embeddings, enhancing the model's ability to learn contextually relevant information.

Additionally, TUPE handles the [CLS] token differently, assigning specialized positional correlations to better capture global context across the sentence. This modification aims to ensure that the [CLS] token can integrate information from all positions without being biased towards the first words.

Experimental Validation

Extensive experiments were conducted, showcasing TUPE's effectiveness over existing methods on the GLUE benchmark. TUPE demonstrated improved performance metrics across nearly all tasks, with notable enhancements in MNLI, CoLA, and RTE tasks. The research also highlights that TUPE can achieve superior results in significantly fewer training steps compared to the baselines, suggesting that TUPE might offer a more efficient and faster convergence during model pre-training.

Implications and Future Work

The methodology presented by TUPE contributes both to the theoretical understanding and practical implementations of positional encoding in LLMs. By untying positional information from word semantics and treating the [CLS] token separately, TUPE informs the design of future models aiming for improved robustness and efficiency in capturing sequence information.

Future research may explore further revisions to positional encoding strategies and investigate their integration into other LLMs or even multimodal architectures. The scalability and transferability of TUPE in diverse natural language processing environments could be focal points for subsequent studies.

In conclusion, this paper makes a substantive contribution to the ongoing discourse on enhancing LLMs by revisiting foundational components such as positional encodings, offering insights that could guide both theoretical explorations and applied enhancements in machine learning methodologies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Guolin Ke (43 papers)
  2. Di He (108 papers)
  3. Tie-Yan Liu (242 papers)
Citations (273)
X Twitter Logo Streamline Icon: https://streamlinehq.com