RoFormer: Transformer with Rotary Position Embedding
- The paper introduces a novel rotary position embedding method that replaces fixed positional encodings with learned rotations in 2D subspaces to jointly encode absolute and relative positions.
- RoFormer achieves sequence-length flexibility by computing rotations on-the-fly, eliminating the need for fixed lookup tables and seamlessly integrating with linear attention mechanisms.
- Empirical benchmarks demonstrate that RoFormer converges faster and improves performance on tasks like machine translation and long-text classification compared to traditional Transformer models.
RoFormer is an enhanced Transformer architecture that incorporates rotary position embedding (RoPE) to encode absolute and relative position information jointly and efficiently within the self-attention mechanism. RoPE replaces additive or fixed positional encodings with a method based on coordinate-wise learned rotations in every 2-dimensional subspace, thus providing a framework that is both theoretically well-justified and practically flexible for a broad class of Transformer-based models (Su et al., 2021).
1. Derivation of Rotary Position Embedding
The objective of RoPE is to define position-aware queries and keys such that their inner product is a function solely of content () and their relative position (), i.e., .
In two dimensions, this is achieved by identifying and representing the embedding via rotation:
The real part of the inner product, , ensures explicit dependence on relative position.
For , the space is decomposed into orthogonal 2-dimensional subspaces, each parameterized with its frequency . The full rotation is encoded with a block-diagonal matrix:
where each block . The choice mirrors the original sinusoidal positional frequencies (Su et al., 2021).
The position-encoded queries and keys are then
leading to attention scores computed as
hence the attention depends explicitly on the relative position .
2. Relative Position Dependency in Self-Attention
Standard scaled dot-product attention with absolute position embedding does not permit explicit control over relative positions, utilizing and . In contrast, with RoPE,
so
This form directly incorporates the relative rotation .
Additionally, the magnitude of the attention score decays with increasing . Specifically,
where is determined by and . An Abel summation argument bounds and formalizes the decay property: under the chosen frequency schedule, this amplitude decreases as increases, biasing attention toward closer tokens (Su et al., 2021).
3. Sequence-Length Flexibility and Compatibility with Linear Attention
RoPE does not rely on fixed-size lookup tables for position encoding, but instead computes a rotation directly. This confers the ability to encode positions beyond the training set's maximum length without modification to model parameters or retraining, yielding unbounded sequence length support.
Rotary embeddings are compatible with kernel-based (linear) attention variants. For an attention kernel using a feature map :
the rotation is applied as
This yields time and memory complexity, maintaining linear scalability for long sequences (Su et al., 2021).
4. Theoretical Properties
Several theoretical aspects underpin RoPE:
- 2D Uniqueness: Any self-attention kernel depending solely on content and the relative position can be realized by appropriate rotations in each 2D subspace (see Appendix A of (Su et al., 2021)).
- Long-Term Decay: The mean cross-term amplitude decreases with for the selected , enforcing preference for short-range attention.
- Norm Preservation and Stability: The rotation matrices are orthogonal, ensuring and preventing accumulation of numerical errors across layers.
5. Empirical Benchmarks
RoFormer models, employing RoPE, have been assessed on multiple language processing benchmarks:
| Task | Baseline | Metric | RoFormer Result |
|---|---|---|---|
| Machine Translation (WMT’14 En→De) | Transformer | BLEU | 27.5 (vs 27.3 baseline) |
| BERT Pre-training (MLM Loss) | BERT-base | Training Loss | Faster convergence |
| GLUE (MRPC, STS-B, QQP tasks) | BERT-base | F1/ρ | Outperforms on 3 out of 6 tasks |
| Performer (Enwik8 char-level LM) | Performer | LM Loss | Faster convergence, lower LM loss |
| Long-Text Classif. (CAIL2019-SCM) | BERT/WoBERT | Accuracy | RoFormer-512: 68.29%; RoFormer-1024: 69.79% |
RoFormer provided consistent empirical gains in long-document classification, converged faster than sinusoidal alternatives in pre-training, and achieved higher or comparable accuracy across various evaluation settings. Specifically, in Chinese long-text benchmarks, RoFormer-1024 improved accuracy by 1.5 percentage points over WoBERT-512 (Su et al., 2021).
6. Implementation and Integration
RoPE is integrated in popular frameworks such as Huggingface Transformers, leveraging its plug-and-play nature in both quadratic and linear attention settings. The required model changes are limited to replacing the original query and key projections with their rotary-embedded counterparts; no architectural or memory cost is incurred beyond this embedding transformation (Su et al., 2021).