Papers
Topics
Authors
Recent
Search
2000 character limit reached

Relative Position Encoding

Updated 25 February 2026
  • Relative Position Encoding is a method that defines token relationships by their distances rather than fixed indices, ensuring translation and shift invariance.
  • It is widely used across transformer models in language, vision, audio, and graphs to enhance generalization and reduce parameter overhead.
  • Empirical results demonstrate its benefits, such as improved performance metrics (e.g., +1.5% Top-1 on ImageNet) and efficient handling of long sequences.

Relative Position Encoding (RPE) is a class of positional encoding schemes for sequence, image, graph, and multi-modal transformer models in which positional information is injected not as a function of token indices themselves, but as a function of the relative distance or relation between pairs of tokens, nodes, or elements. In contrast to absolute position encoding—which ties model representations to fixed coordinate systems—RPE enables translation-invariant or shift-invariant modeling, better generalization to longer or differently-structured inputs, and more direct encoding of local and non-local relationships. RPE schemes have become foundational in LLMs, vision transformers, audio, graph, and multi-view/multi-modal transformers.

1. Principles and Mathematical Foundations

Classic transformer models, such as in Vaswani et al. ("Attention Is All You Need"), inject absolute position embeddings pip_i into each input token xix_i, yielding content that is sensitive to the single position index ii. In contrast, relative position encoding injects pairwise information rijr_{i-j} (or variants thereof) in the attention mechanism so that attention weights, or their parameterizations, become functions of the distance iji-j (or a general relational operator) between tokens or elements.

A canonical mathematical scheme for 1D sequences is the so-called "key-relative bias" variant:

Attention(qi,kj)=1d[qi(kj+rij)]\mathrm{Attention}(q_i, k_j) = \frac{1}{\sqrt{d}} \big[ q_i^\top (k_j + r_{i-j}) \big]

with rr as a learned or structured table of bias vectors indexed by offset iji-j (Pham et al., 2020, Huang et al., 2020, Chen, 2021).

Alternative forms include direct bias addition to logits ("attention-bias"):

Attention(qi,kj)=1d(qikj+bij)\mathrm{Attention}(q_i, k_j) = \frac{1}{\sqrt{d}} \left( q_i^\top k_j + b_{i-j} \right)

with bb a learned or structured bias vector (Lv et al., 28 Jan 2025, Foumani et al., 2023, Hao et al., 2024).

Empirically, relative encodings:

  • Generalize out-of-domain to longer sequence lengths (since they store only offsetwise, not absolute, relations).
  • Require fewer parameters to capture spatial or sequential relations (O(L)O(L) or O(2L1)O(2L-1)) versus explicit absolute tables over potentially unbounded position indices.
  • Better support translation invariance, topology-awareness (graph, 2D, or 3D geometry), and can reflect locality or directionality.

2. Core Methods Across Modalities

Sequences and Language

Shaw et al. (2018)-style RPE adds learned embedding vectors rijr_{i-j} or bijb_{i-j} to keys or directly as attention biases, with distance clipping for parameter efficiency (Pham et al., 2020, Huang et al., 2020). Enhanced forms generalize the bias to closures over queries and keys (“contextual mode”):

eij=(QiKj)+(Qirij)+(Kjrij)e_{ij} = (Q_i K_j) + (Q_i r_{i-j}) + (K_j r_{i-j})

(Huang et al., 2020).

Rotary Position Embedding (RoPE) represents positions as complex/real plane or Lie group rotations such that (qi,kj)(q_i, k_j) rotations encode only the difference jij-i, and generalizations to multidimensional spaces or noncommutative rotations are possible (Ostmeier et al., 2024, Chen, 2021).

Efficient RPE for Linear Attention: For kernel-based or Performer attention (O(L)O(L) time), naive pairwise RPE breaks tractability. Solutions such as PermuteFormer encode relative position by position-dependent linear maps (permutations, scalings), preserving relative invariance and linear complexity (Chen, 2021, Qin et al., 2023).

Vision (2D, 3D, Multimodal)

2D and 3D Scheme Extension:

  • Directional and cross-product RPEs separate x and y (and z) offsets, with contextual bucketing by (Δx, Δy) or their products; this enables explicit modeling of horizontal, vertical, or spatially composite relations (Wu et al., 2021, Shen et al., 2023).
  • Contextual/semantic-aware RPE (e.g., SaPE²) leverages gates or affinity scores informed by content, yielding content-aware RPE that tracks semantic similarity rather than just geometric distances (Chen et al., 14 May 2025).
  • Affine-invariant RPE (e.g., KP-RPE) adapts bias fields based on keypoint anchors, robustly encoding relationships under affine image transformations (Kim et al., 2024).

Multi-view, Camera, and Multi-modal Relative Encoding:

  • Camera-aware RPEs (PRoPE) encode the full projective geometry (intrinsics and extrinsics) between viewpoints in the attention bias, supporting robust generalization under camera/scene variation (Li et al., 14 Jul 2025).

Graphs and Structured Domains

GRPE introduces relative position encodings on graphs using topological distances (shortest path) and edge types, integrated as content-aware vector dot products between node queries/keys and tabled relation embeddings. This scheme avoids graph linearization and preserves full topology-awareness (Park et al., 2022).

Temporal and Spatio-Temporal

Video and Time Series RPE: For videos and temporal signals, RPE is extended to temporal, spatial, or spatio-temporal axes, often as low-rank or parameter-efficient dictionaries indexed by relative offsets. These are efficiently combined in grouped or factorized MLP or sparse attention blocks (Hao et al., 2024, Foumani et al., 2023).

3. Theoretical Properties and Algorithmic Trade-offs

Relative position encodings are fundamentally defined by their shift, translation, and topology invariance:

  • Shift-invariance: Attention scores depend only on iji-j, so models trained on short sequences can immediately extrapolate.
  • Parameter efficiency: Only O(L)O(L) parameters or fewer, especially with bucketing or tying, even in high-dimensional settings.
  • Algorithmic overhead: Standard RPE increases O(L2)O(L^2) cost, breaking the linear-time properties of efficient transformer variants unless specially designed (e.g., via unitary transforms, permutations, or kernel compatible RPE) (Chen, 2021, Qin et al., 2023).

Empirical findings indicate that:

  • Contextual and cross-content RPEs outperform simple bias variants in vision/classification.
  • The inductive bias introduced (e.g., by directional or semantic bucketing) improves structure-sensitive domains (images, graphs, code).
  • RPE can fully replace absolute position encoding in many domains with no loss and frequently with measurable gain.

A summary of variants and their compatibility/complexity properties:

Scheme Complexity Parameter Count Generalization
Shaw et al. 1D/2D O(L2d)O(L^2 d) O(2kd)O(2k d) or O(k2d)O(k^2 d) Good (distance)
PermuteFormer O(Lm2)O(L m^2) negligible extra Full (linear time)
LRPE/Unitary (linear attn) O(Ld2)O(L d^2) O(d2)O(d^2) per pos As above
Contextual (product/cross) O(nkd)O(n k d) O(kd)O(k d) Enhanced (semantics)
Semantic-aware (SaPE²) O(N2)O(N^2) O(M+1)O(M+1) Semantic + shift-inv
Graph (GRPE) O(N2)O(N^2) O(Ld)O(L d) Full (graph)

4. Domain-Specific Adaptations and Innovations

  • Speech: RPEs built from bidirectional sinusoids adapt seamlessly to much longer and more variable input with negligible extra cost, outperforming absolute encoding on multi-hour speech recognition and translation (Pham et al., 2020, Likhomanenko et al., 2021).
  • 3D Object Detection: Vertex-based RPE encodes per-query, per-vertex offsets in canonical boxes, providing differentiable, locality-aware priors that outperform both standard RPE and hard box-masks in DETR-style architectures (Shen et al., 2023).
  • Multi-modal Models: Circle-RoPE addresses cross-modal bias in large vision-LLMs, mapping image token positions onto a geometric structure orthogonal to text, reducing artificial alignment and improving spatial robustness (Wang et al., 22 May 2025).
  • Affine Robustness in Vision: KP-RPE achieves alignment invariance by warping positional bias fields conditional on detected keypoints, preserving spatial priors under large transformations (Kim et al., 2024).

5. Empirical Impact, Ablation Studies, and Limitations

RPE forms consistently deliver gains across vision (up to +1.5% Top-1 on ImageNet in DeiT/ViT), speech (up to 7% relative WER reduction), code-editing (1–3% absolute improvement in patch accuracy), and time series (best average rank across 30+ multivariate datasets). Ablations confirm:

  • Contextual and multi-branch (Q, K, V) injections are superior to scalar or index-only biases (Wu et al., 2021, Huang et al., 2020).
  • Composite directional buckets and semantic gate mechanisms enhance translation and scale equivariance.
  • For efficient attention (Performer, linear transformers), only specially designed RPEs (e.g., PermuteFormer, linearized unitary transforms) can maintain O(L)O(L) complexity.

Limitations arise due to increased O(N2)O(N^2) cost in naive high-dimensional or semantic RPE, limitations in encoding beyond permutation order in permutation-based methods, and sensitivity to domain-specific preprocessing (e.g., keypoint detection, camera pose consistency). Open challenges include further reducing memory/computational overhead for very large contexts and enabling differentiable learning of permutation or semantic structure in efficient RPEs (Chen, 2021, Chen et al., 14 May 2025, Foumani et al., 2023).

6. Future Directions and Open Challenges

Research directions for RPE include:

7. Key References

Relative position encoding has thus evolved into a foundational, extensible mechanism for geometric, relational, or context-invariant modeling across modern transformer architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Relative Position Encoding.