Papers
Topics
Authors
Recent
2000 character limit reached

RoFormer: Transformer with Rotary Position Embedding

Updated 20 November 2025
  • The paper introduces a novel rotary position embedding method that replaces fixed positional encodings with learned rotations in 2D subspaces to jointly encode absolute and relative positions.
  • RoFormer achieves sequence-length flexibility by computing rotations on-the-fly, eliminating the need for fixed lookup tables and seamlessly integrating with linear attention mechanisms.
  • Empirical benchmarks demonstrate that RoFormer converges faster and improves performance on tasks like machine translation and long-text classification compared to traditional Transformer models.

RoFormer is an enhanced Transformer architecture that incorporates rotary position embedding (RoPE) to encode absolute and relative position information jointly and efficiently within the self-attention mechanism. RoPE replaces additive or fixed positional encodings with a method based on coordinate-wise learned rotations in every 2-dimensional subspace, thus providing a framework that is both theoretically well-justified and practically flexible for a broad class of Transformer-based models (Su et al., 2021).

1. Derivation of Rotary Position Embedding

The objective of RoPE is to define position-aware queries qm=fq(xm,m)q_m = f_q(x_m, m) and keys kn=fk(xn,n)k_n = f_k(x_n, n) such that their inner product is a function solely of content (xm,xnx_m, x_n) and their relative position (mnm-n), i.e., fq(xm,m),fk(xn,n)=g(xm,xn,mn)\langle f_q(x_m,m), f_k(x_n,n)\rangle = g(x_m, x_n, m-n).

In two dimensions, this is achieved by identifying R2C\mathbb{R}^2 \cong \mathbb{C} and representing the embedding via rotation:

fq(xm,m)=xmeimθ,fk(xn,n)=xneinθ.f_q(x_m,m) = x_m \cdot e^{i m\theta}, \quad f_k(x_n, n) = x_n \cdot e^{i n\theta}.

The real part of the inner product, Re[fqfk]=Re[xmxnei(mn)θ]Re[f_q^\top f_k] = Re[x_m x_n^* e^{i (m-n) \theta}], ensures explicit dependence on relative position.

For Rd\mathbb{R}^d, the space is decomposed into d/2d/2 orthogonal 2-dimensional subspaces, each parameterized with its frequency θi\theta_i. The full rotation is encoded with a block-diagonal matrix:

Rd(Θ,m)=diag[R(mθ1),...,R(mθd/2)],R_d(\Theta, m) = \mathrm{diag}\left[R(m\theta_1), ..., R(m\theta_{d/2})\right],

where each 2×22 \times 2 block R(ϕ)=[cosϕsinϕ sinϕcosϕ]R(\phi) = \begin{bmatrix} \cos \phi & -\sin \phi \ \sin \phi & \cos \phi \end{bmatrix}. The choice θi=100002(i1)/d\theta_i = 10000^{-2(i-1)/d} mirrors the original sinusoidal positional frequencies (Su et al., 2021).

The position-encoded queries and keys are then

qm=Rd(Θ,m)(Wqxm),kn=Rd(Θ,n)(Wkxn),q_m = R_d(\Theta, m)\cdot (W_q x_m), \quad k_n = R_d(\Theta, n)\cdot (W_k x_n),

leading to attention scores computed as

QmKn=(Wqxm)Rd(Θ,nm)(Wkxn),Q_m^\top K_n = (W_q x_m)^\top R_d(\Theta, n-m)(W_k x_n),

hence the attention depends explicitly on the relative position nmn-m.

2. Relative Position Dependency in Self-Attention

Standard scaled dot-product attention with absolute position embedding does not permit explicit control over relative positions, utilizing Qm=Wq(xm+pm)Q_m = W_q(x_m + p_m) and Kn=Wk(xn+pn)K_n = W_k(x_n + p_n). In contrast, with RoPE,

Qm=Rd(Θ,m)Wqxm,Kn=Rd(Θ,n)Wkxn,Q_m = R_d(\Theta, m) W_q x_m, \quad K_n = R_d(\Theta, n) W_k x_n,

so

scorem,n=(Wqxm)Rd(Θ,nm)(Wkxn)/d.\text{score}_{m,n} = (W_q x_m)^\top R_d(\Theta, n-m) (W_k x_n)/\sqrt{d}.

This form directly incorporates the relative rotation Rd(nm)R_d(n-m).

Additionally, the magnitude of the attention score decays with increasing mn|m-n|. Specifically,

scorem,n=Re[i=1d/2hiei(mn)θi],\text{score}_{m,n} = \mathrm{Re}\left[\sum_{i=1}^{d/2} h_i \cdot e^{i (m-n)\theta_i}\right],

where hih_i is determined by xmx_m and xnx_n. An Abel summation argument bounds j=1iei(mn)θj|\sum_{j=1}^{i}e^{i(m-n)\theta_j}| and formalizes the decay property: under the chosen frequency schedule, this amplitude decreases as mn|m-n| increases, biasing attention toward closer tokens (Su et al., 2021).

3. Sequence-Length Flexibility and Compatibility with Linear Attention

RoPE does not rely on fixed-size lookup tables for position encoding, but instead computes a rotation mθim\cdot \theta_i directly. This confers the ability to encode positions beyond the training set's maximum length without modification to model parameters or retraining, yielding unbounded sequence length support.

Rotary embeddings are compatible with kernel-based (linear) attention variants. For an attention kernel using a feature map ϕ\phi:

Attentionm=nϕ(Qm)ϕ(Kn)Vn/nϕ(Qm)ϕ(Kn),\text{Attention}_m = \sum_n \phi(Q_m)^\top \phi(K_n) V_n \bigg/ \sum_n \phi(Q_m)^\top \phi(K_n),

the rotation is applied as

ϕˉ(Qm)=Rd(Θ,m)ϕ(Wqxm),ϕˉ(Kn)=Rd(Θ,n)ϕ(Wkxn).\bar{\phi}(Q_m) = R_d(\Theta, m) \phi(W_q x_m), \quad \bar{\phi}(K_n) = R_d(\Theta, n) \phi(W_k x_n).

This yields O(Nd)O(Nd) time and memory complexity, maintaining linear scalability for long sequences (Su et al., 2021).

4. Theoretical Properties

Several theoretical aspects underpin RoPE:

  • 2D Uniqueness: Any self-attention kernel g(xm,xn,mn)g(x_m, x_n, m-n) depending solely on content and the relative position can be realized by appropriate rotations in each 2D subspace (see Appendix A of (Su et al., 2021)).
  • Long-Term Decay: The mean cross-term amplitude ei(mn)θi|\sum e^{i (m-n)\theta_i}| decreases with mn|m-n| for the selected θi\theta_i, enforcing preference for short-range attention.
  • Norm Preservation and Stability: The rotation matrices Rd(Θ,m)R_d(\Theta, m) are orthogonal, ensuring Qm=Wqxm\|Q_m\| = \|W_q x_m\| and preventing accumulation of numerical errors across layers.

5. Empirical Benchmarks

RoFormer models, employing RoPE, have been assessed on multiple language processing benchmarks:

Task Baseline Metric RoFormer Result
Machine Translation (WMT’14 En→De) Transformer BLEU 27.5 (vs 27.3 baseline)
BERT Pre-training (MLM Loss) BERT-base Training Loss Faster convergence
GLUE (MRPC, STS-B, QQP tasks) BERT-base F1/ρ Outperforms on 3 out of 6 tasks
Performer (Enwik8 char-level LM) Performer LM Loss Faster convergence, lower LM loss
Long-Text Classif. (CAIL2019-SCM) BERT/WoBERT Accuracy RoFormer-512: 68.29%; RoFormer-1024: 69.79%

RoFormer provided consistent empirical gains in long-document classification, converged faster than sinusoidal alternatives in pre-training, and achieved higher or comparable accuracy across various evaluation settings. Specifically, in Chinese long-text benchmarks, RoFormer-1024 improved accuracy by 1.5 percentage points over WoBERT-512 (Su et al., 2021).

6. Implementation and Integration

RoPE is integrated in popular frameworks such as Huggingface Transformers, leveraging its plug-and-play nature in both quadratic and linear attention settings. The required model changes are limited to replacing the original query and key projections with their rotary-embedded counterparts; no architectural or memory cost is incurred beyond this embedding transformation (Su et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Roformer: Enhanced Transformer with Rotary Position Embedding.