RoFormer: Enhanced Transformer with Rotary Position Embedding (2104.09864v5)

Published 20 Apr 2021 in cs.CL, cs.AI, and cs.LG

Abstract: Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based LLMs. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. Notably, RoPE enables valuable properties, including the flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding. Finally, we evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets. Our experiments show that it consistently overcomes its alternatives. Furthermore, we provide a theoretical analysis to explain some experimental results. RoFormer is already integrated into Huggingface: \url{https://huggingface.co/docs/transformers/model_doc/roformer}.

PDF Abstract

Introduction

In the domain of NLP, a fundamental component of state-of-the-art models is their ability to understand the order and relative positions of words or tokens within input sequences. This understanding is crucial as it allows models to interpret and predict language with greater accuracy. Traditionally, various methods have been crafted to imbue models with this positional awareness.

Rotary Position Embedding

The paper presents a novel strategy for integrating position information into LLMs, termed Rotary Position Embedding (RoPE). This method distinguishes itself by its approach of encoding absolute position through a rotation matrix. This matrix mathematically rotates the embedding vectors of the tokens, thus embedding the tokens' position information into their representations. Key benefits of RoPE include its adaptability to varying sequence lengths and its capability to maintain the significance of position information even as the distance between tokens increases—a property that mirrors natural language where the relevance of words typically diminishes as the distance between them grows.

Evaluation and Contributions

The RoPE-integrated transformer model, referred to as RoFormer, was empirically evaluated on a variety of benchmark datasets for long text classification and was found to outperform baseline models. There are three primary contributions from this research:

It offers a new perspective on exploiting positional information in LLMs—a rotational encoding method that infers relative positions through the product of context representation and a rotation matrix.
The properties of RoPE are studied, highlighting its beneficial decay with increased relative distances, making it suitable for natural language encoding.
The resulting RoFormer model, when evaluated, demonstrates superior performance over existing models on various benchmark datasets, showing the practical effectiveness of this novel encoding technique.

Theoretical Foundation and Limitations

An underlying theoretical explanation is provided, showing how relative position encoding can be interpreted as rotations in a two-dimensional plane and generalized for higher dimensions. Despite its proven effectiveness and promise, the authors acknowledge limitations, including the lack of a comprehensive understanding of why RoPE leads to faster convergence in training compared to other positional embedding methods. Additionally, explaining the superior long-text handling ability of RoFormer has not been fully understood. The paper indicates that further investigation into these aspects is necessary.

Conclusion

In sum, the introduction of RoPE signifies a forward leap in transformer-based LLMs. By using a rotational mechanism for position encoding, the RoFormer shows potential in improving efficiency and effectiveness across a spectrum of NLP tasks, particularly those dealing with extended text. This innovation adds a valuable option to the repertoire of techniques available for enhancing the language understanding capabilities of AI models.