RoFormer: Transformer with Rotary Position Embedding

Updated 20 November 2025

The paper introduces a novel rotary position embedding method that replaces fixed positional encodings with learned rotations in 2D subspaces to jointly encode absolute and relative positions.
RoFormer achieves sequence-length flexibility by computing rotations on-the-fly, eliminating the need for fixed lookup tables and seamlessly integrating with linear attention mechanisms.
Empirical benchmarks demonstrate that RoFormer converges faster and improves performance on tasks like machine translation and long-text classification compared to traditional Transformer models.

RoFormer is an enhanced Transformer architecture that incorporates rotary position embedding (RoPE) to encode absolute and relative position information jointly and efficiently within the self-attention mechanism. RoPE replaces additive or fixed positional encodings with a method based on coordinate-wise learned rotations in every 2-dimensional subspace, thus providing a framework that is both theoretically well-justified and practically flexible for a broad class of Transformer-based models (Su et al., 2021).

1. Derivation of Rotary Position Embedding

The objective of RoPE is to define position-aware queries $q_m = f_q(x_m, m)$ and keys $k_n = f_k(x_n, n)$ such that their inner product is a function solely of content ( $x_m, x_n$ ) and their relative position ( $m-n$ ), i.e., $\langle f_q(x_m,m), f_k(x_n,n)\rangle = g(x_m, x_n, m-n)$ .

In two dimensions, this is achieved by identifying $\mathbb{R}^2 \cong \mathbb{C}$ and representing the embedding via rotation:

$f_q(x_m,m) = x_m \cdot e^{i m\theta}, \quad f_k(x_n, n) = x_n \cdot e^{i n\theta}.$

The real part of the inner product, $Re[f_q^\top f_k] = Re[x_m x_n^* e^{i (m-n) \theta}]$ , ensures explicit dependence on relative position.

For $\mathbb{R}^d$ , the space is decomposed into $d/2$ orthogonal 2-dimensional subspaces, each parameterized with its frequency $\theta_i$ . The full rotation is encoded with a block-diagonal matrix:

$R_d(\Theta, m) = \mathrm{diag}\left[R(m\theta_1), ..., R(m\theta_{d/2})\right],$

where each $2 \times 2$ block $R(\phi) = \begin{bmatrix} \cos \phi & -\sin \phi \ \sin \phi & \cos \phi \end{bmatrix}$ . The choice $\theta_i = 10000^{-2(i-1)/d}$ mirrors the original sinusoidal positional frequencies (Su et al., 2021).

The position-encoded queries and keys are then

$q_m = R_d(\Theta, m)\cdot (W_q x_m), \quad k_n = R_d(\Theta, n)\cdot (W_k x_n),$

leading to attention scores computed as

$Q_m^\top K_n = (W_q x_m)^\top R_d(\Theta, n-m)(W_k x_n),$

hence the attention depends explicitly on the relative position $n-m$ .

2. Relative Position Dependency in Self-Attention

Standard scaled dot-product attention with absolute position embedding does not permit explicit control over relative positions, utilizing $Q_m = W_q(x_m + p_m)$ and $K_n = W_k(x_n + p_n)$ . In contrast, with RoPE,

$Q_m = R_d(\Theta, m) W_q x_m, \quad K_n = R_d(\Theta, n) W_k x_n,$

$\text{score}_{m,n} = (W_q x_m)^\top R_d(\Theta, n-m) (W_k x_n)/\sqrt{d}.$

This form directly incorporates the relative rotation $R_d(n-m)$ .

Additionally, the magnitude of the attention score decays with increasing $|m-n|$ . Specifically,

$\text{score}_{m,n} = \mathrm{Re}\left[\sum_{i=1}^{d/2} h_i \cdot e^{i (m-n)\theta_i}\right],$

where $h_i$ is determined by $x_m$ and $x_n$ . An Abel summation argument bounds $|\sum_{j=1}^{i}e^{i(m-n)\theta_j}|$ and formalizes the decay property: under the chosen frequency schedule, this amplitude decreases as $|m-n|$ increases, biasing attention toward closer tokens (Su et al., 2021).

3. Sequence-Length Flexibility and Compatibility with Linear Attention

RoPE does not rely on fixed-size lookup tables for position encoding, but instead computes a rotation $m\cdot \theta_i$ directly. This confers the ability to encode positions beyond the training set's maximum length without modification to model parameters or retraining, yielding unbounded sequence length support.

Rotary embeddings are compatible with kernel-based (linear) attention variants. For an attention kernel using a feature map $\phi$ :

$\text{Attention}_m = \sum_n \phi(Q_m)^\top \phi(K_n) V_n \bigg/ \sum_n \phi(Q_m)^\top \phi(K_n),$

the rotation is applied as

$\bar{\phi}(Q_m) = R_d(\Theta, m) \phi(W_q x_m), \quad \bar{\phi}(K_n) = R_d(\Theta, n) \phi(W_k x_n).$

This yields $O(Nd)$ time and memory complexity, maintaining linear scalability for long sequences (Su et al., 2021).

4. Theoretical Properties

Several theoretical aspects underpin RoPE:

2D Uniqueness: Any self-attention kernel $g(x_m, x_n, m-n)$ depending solely on content and the relative position can be realized by appropriate rotations in each 2D subspace (see Appendix A of (Su et al., 2021)).
Long-Term Decay: The mean cross-term amplitude $|\sum e^{i (m-n)\theta_i}|$ decreases with $|m-n|$ for the selected $\theta_i$ , enforcing preference for short-range attention.
Norm Preservation and Stability: The rotation matrices $R_d(\Theta, m)$ are orthogonal, ensuring $\|Q_m\| = \|W_q x_m\|$ and preventing accumulation of numerical errors across layers.

5. Empirical Benchmarks

RoFormer models, employing RoPE, have been assessed on multiple language processing benchmarks:

Task	Baseline	Metric	RoFormer Result
Machine Translation (WMT’14 En→De)	Transformer	BLEU	27.5 (vs 27.3 baseline)
BERT Pre-training (MLM Loss)	BERT-base	Training Loss	Faster convergence
GLUE (MRPC, STS-B, QQP tasks)	BERT-base	F1/ρ	Outperforms on 3 out of 6 tasks
Performer (Enwik8 char-level LM)	Performer	LM Loss	Faster convergence, lower LM loss
Long-Text Classif. (CAIL2019-SCM)	BERT/WoBERT	Accuracy	RoFormer-512: 68.29%; RoFormer-1024: 69.79%

RoFormer provided consistent empirical gains in long-document classification, converged faster than sinusoidal alternatives in pre-training, and achieved higher or comparable accuracy across various evaluation settings. Specifically, in Chinese long-text benchmarks, RoFormer-1024 improved accuracy by 1.5 percentage points over WoBERT-512 (Su et al., 2021).

6. Implementation and Integration

RoPE is integrated in popular frameworks such as Huggingface Transformers, leveraging its plug-and-play nature in both quadratic and linear attention settings. The required model changes are limited to replacing the original query and key projections with their rotary-embedded counterparts; no architectural or memory cost is incurred beyond this embedding transformation (Su et al., 2021).

PDF Markdown Chat (Pro)

References (1)

RoFormer: Enhanced Transformer with Rotary Position Embedding (2021)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Roformer: Enhanced Transformer with Rotary Position Embedding.