Relative Position Representation (RPE)

Updated 12 November 2025

Relative Position Representation (RPE) is a set of techniques that encode pairwise relational information in attention models, ensuring order and locality in data.
RPE methods are applied across NLP, vision, graphs, and video by integrating techniques like bias tables, rotary encodings, and multi-scale representations.
Empirical results show RPE improves benchmarks in language and vision tasks, enhances long-context generalization, and provides efficient, scalable model design.

Relative Position Representation (RPE) is a family of techniques that augment neural network models—predominantly Transformers—with explicit information about the geometric, temporal, or logical relationship between pairs of data elements, enabling order and locality awareness while maintaining translation or permutation equivariance. RPE has seen broad adoption in natural language processing, computer vision, graph learning, molecular modeling, video understanding, structured data tasks, and emerging modalities such as spiking neural networks, due to its flexibility, efficiency, and ability to encode domain-specific relational inductive biases.

1. Mathematical Foundations of Relative Position Representations

RPE augments the standard self-attention mechanism by modifying or biasing the attention computation based on the pairwise relationship (typically signed distance or spatial offset) between input elements. In the canonical formulation by Shaw et al. (Shaw et al., 2018), for a sequence of hidden vectors $\boldsymbol{x}_1,\dots,\boldsymbol{x}_n$ , the $i$ – $j$ attention score becomes: $e_{ij} = \frac{(\boldsymbol{x}_i W^Q)[(\boldsymbol{x}_j W^K) + w^K_{\operatorname{clip}(j-i,k)}]^\top}{\sqrt{d}}$ where $w^K_{\Delta}$ is a learned embedding table indexed by relative distance $\Delta = j-i$ (clipped to $|k|$ for $|j-i|>k$ ). The value aggregation can analogously receive a relative embedding, leading to explicit parameterization: $z_i = \sum_{j=1}^n \alpha_{ij} \big[ (\boldsymbol{x}_j W^V) + w^V_{\operatorname{clip}(j-i,k)} \big]$ This approach generalizes easily to arbitrary discrete or continuous relational features, including higher-dimensional spatial offsets, keypoint-based distances, graph metrics (e.g. shortest-path, resistance distance), temporal lags, or domain-specific relations.

From an operator perspective, RPE induces an additive or multiplicative bias $b_{ij}$ into the attention logit, with various implementations:

Additive bias (most common): $e_{ij} = (\cdot) + b_{ij}$
Multiplicative and contextual RPE (e.g., iRPE for ViTs): $e_{ij} = (\cdot) + \mathbf{q}_i^\top \mathbf{r}_{g(\delta_{ij})}$
Toeplitz or convolutional structure in 1D RPE, allowing efficient implementation via FFT (Luo et al., 2021)

A critical distinction is between absolute (index-based) and relative (offset-based) encoding. Absolute encodings inject position individually; RPE conditions attention directly on pairwise relations—yielding natural shift invariance (Shaw et al., 2018, Wu et al., 2021).

2. Taxonomy of RPE Methods Across Modalities

RPE methods are specialized according to data modality, problem structure, and inductive biases:

Text and Sequence Data:

Classical embedding tables for $j-i$ (Shaw et al., 2018)
Toeplitz or low-rank representations for efficient scaling (Luo et al., 2021)
Spectral (Fourier) RPEs (learned in frequency domain) for linearized attention (Choromanski et al., 2023)
Kernelized or stochastic RPE (Liutkus et al., 2021)

Vision:

2D RPE for ViTs: bucketed 2D coordinate bins or product/cross of axis offsets (Wu et al., 2021)
Directional/dynamic RPE: keypoint anchoring (KP-RPE), affine-invariant methods (Kim et al., 21 Mar 2024)
iRPE image-specific contextual RPE with query-interaction (Wu et al., 2021)
Spatio-temporal extensions for video, capturing frame and spatial relations jointly (Hao et al., 3 Jul 2024)

Graphs:

RPE as maps $U_G(v,u)$ encoding graph distances, resistance, diffusion, or spectral kernels; biasing attention with these features (Black et al., 22 Feb 2024)
Theoretical equivalence of APE and RPE in distinguishing graph topologies, with recommendations for diagonal and combinatorial awareness

3D, Trajectory, and Geometric Data:

3D vertex-based RPE for point clouds (3DV-RPE): encoding offsets to bounding-box vertices, enforcing locality (Shen et al., 2023)
Rotary variants for non-Euclidean domains (Bloch sphere, angular periodicity): 3D-RPE (adds chunked rotary encoding) (Ma et al., 14 Jun 2024), DRoPE (rotary with angular periodicity for heading) (Zhao et al., 19 Mar 2025)

Spiking and Non-Standard Architectures:

Hamming-invariant, spike-native RPEs such as Gray-PE and Log-PE compatible with XNOR-spiking attention (Lv et al., 28 Jan 2025)

Multi-Scale and Extrapolative RPE:

Wavelet-based RPE: multi-scale parameterization, generalizing RoPE as a fixed-scale Haar transform to analytic, continuous wavelets for extrapolation (Oka et al., 4 Feb 2025)

3. Practical Implementation Mechanisms

Attention Bias Table Construction:

For 1D or 2D spatial RPE, a small bias table (e.g. $(2K+1)^d$ for $d$ dims) is learned and indexed via piecewise bucketing, axis quantization, keypoint distance, or direct offset computation (Wu et al., 2021, Kim et al., 21 Mar 2024).
For combinatorial or continuous relations, analytic or spectral calculation replaces explicit tables (Choromanski et al., 2023, Oka et al., 4 Feb 2025).

Integration into Self-Attention:

Additive bias: introduce $b_{ij}$ pre-softmax.
Contextual bias: modulate bias via interaction with the query or key, e.g. $b_{ij} = \mathbf{q}_i^\top \mathbf{r}_{g(\delta_{ij})}$ for richer modeling (Wu et al., 2021).
Value augmentation: shift values or outputs by RPE as well as logits.

Scaling and Efficiency:

Toeplitz or convolutional structure exploited for $\mathcal{O}(n\log n)$ or linear complexity using FFT/convolution (Luo et al., 2021, Choromanski et al., 2023).
In spatio-temporal or video settings, parameter grouping and window-based inference reduce parameter and compute cost (Hao et al., 3 Jul 2024).
Rotary approaches (RoPE, DRoPE, 3D-RPE) encode relative information implicitly via blockwise vector rotation, incurring minimal memory overhead (Zhao et al., 19 Mar 2025, Ma et al., 14 Jun 2024).

4. Empirical Impact and Domain-Specific Insights

Performance Gains:

Language modeling: RPE yields +0.3 to +1.3 BLEU over absolute, ablation confirms necessity for generalization to long contexts (Shaw et al., 2018).
Vision: 2D RPE improves DeiT/ViT by +1.5% top-1 (ImageNet), DETR by +1.3 mAP (COCO), KPI-RPE outperforms static RPEs on alignment-perturbed face datasets by >15% in TAR@1e-4 (Wu et al., 2021, Kim et al., 21 Mar 2024).
3D point clouds: 3DV-RPE boosts AP25/AP50 by +12–19 points (ScanNetV2), outperforming box-mask or center-based RPEs (Shen et al., 2023).
Video/spatio-temporal: parameterized RPE in MLP blocks achieve competitive accuracy with dramatic parameter/FLOPs reduction (Hao et al., 3 Jul 2024).
Spiking models: Gray-PE, Log-PE raise $R^2$ by >1% on time series, +3% accuracy on text classification; 2D extension lifts image classification by up to 0.4% (Lv et al., 28 Jan 2025).
Extrapolative regimes: wavelet-based RPE and spectral/learned-Fourier approaches maintain or improve perplexity and classification accuracy at sequence lengths far beyond training, outperforming RoPE interpolations (Oka et al., 4 Feb 2025, Choromanski et al., 2023).

Theoretical Contributions:

Graphs: RPE and APE are equivalent in distinguishing power for graph transformers, but RPE can operationalize arbitrarily complex, domain-aware node/edge relations, including those that match or exceed the Weisfeiler–Leman hierarchy (Black et al., 22 Feb 2024).
Universality: Standard RPE-based Transformers are not universal approximators if RPE bias is only in the softmax, but extensions (e.g., gating as in URPE) can restore universal function approximation (Luo et al., 2022).
Rotary encodings: DRoPE (directional RoPE) is necessary for angular periodicity—critical in agent/trajectory modeling—whereas standard RoPE fails to mod out $2\pi$ (Zhao et al., 19 Mar 2025); 3D-RPE mitigates long-context decay and interpolation resolution loss (Ma et al., 14 Jun 2024).

5. Architectural Variations and Recent Extensions

Class of RPE	Core Idea	Example Contexts
Bucketing/Bias Table	Learn bias per discrete relative offset	Shaw et al. RPE, iRPE, T5
Contextual/Query-aware	Make bias table query-dependent	iRPE (contextual mode), KP-RPE
Fourier/Spectral	Learn RPE as spectral density (random features)	FLT, kernelized RPE, stochastic PE
Rotary (RoPE/3D-RPE/DRoPE)	Encode offset by coordinate-wise rotation	RoPE/DRoPE/3D-RPE, LLMs, agents
Wavelet-based	Multi-scale RPE via analytic wavelets	Long-context LMs (Oka et al., 4 Feb 2025)
Graph metric-based	SPD/RD/Kernel as RPE, diffusion on graphs	Graph Transformers (Black et al., 22 Feb 2024)
Keypoint/dynamic	Condition bias on detected/estimated keypoints	Face recognition KP-RPE (Kim et al., 21 Mar 2024)
Spike-compatible	Hamming-invariant, XNOR-attn (Gray-PE, Log-PE)	Spiking Transformers (Lv et al., 28 Jan 2025)

Recent lines of research focus on extrapolation and efficiency:

3D-RPE encodes offset in a chunked Bloch-sphere style, dramatically improving long-context generalization and effective position resolution under interpolation (Ma et al., 14 Jun 2024).
CARoPE introduces data- and head-dependent rotary frequencies, breaking RoPE rigidity and improving both throughput and model scaling at long context (Veisi et al., 30 Jul 2025).

6. Design Recommendations and Limitations

Guidelines:

For vision: Use 2D bucketed, contextual RPE rather than naive 1D.
For video: Employ axis-specific parameterized RPE and exploit channel grouping for efficient diversity (Hao et al., 3 Jul 2024).
For long-range text or code: Multi-scale (wavelet or Fourier) RPEs avoid catastrophic degradation with extrapolation.
For graph learning: Choose RPEs aligned with desired isomorphism power; for strong local/adjacency bias, include combinatorial edge features (Black et al., 22 Feb 2024).
For high efficiency: Rotary encodings (RoPE, DRoPE, 3D-RPE) provide near-zero overhead, spectrum of inductive biases, and scale.
In spiking regimes: XNOR/Hamming-aware encodings preserve event-based computation, but may require further extension for very long horizons (Lv et al., 28 Jan 2025).

Caveats:

Table-lookup RPEs still incur $O(n^2)$ memory unless symmetry/Toeplitzness is exploited.
Contextual/dynamic RPEs (e.g. KP-RPE) add negligible compute, but require reliable auxiliary cues (keypoints).
Standard RPE cannot encode functions that distinguish between identical (row-equal) inputs (Luo et al., 2022).
In graph transformers, degree of diagonal/combinatorial awareness may limit expressive power.
In extended contexts, only RPEs with explicit multi-scale or multi-phase mechanisms can reliably extrapolate (Oka et al., 4 Feb 2025, Ma et al., 14 Jun 2024).

7. Future Directions and Open Problems

Several axes remain active:

Universal, scalable, multi-modal RPEs unifying spatial, temporal, and logical relationships efficiently.
Adaptive RPEs sensitive to local structure, data, or downstream objectives (as in CARoPE).
Theoretical characterizations—beyond expressivity and universality—of learnable RPE efficiency for generalization and extrapolation.
Efficient hardware mappings for kernelized, Toeplitz, and spiking RPEs.
Domain-specific RPE in areas like protein folding, computational chemistry, and social network analysis, leveraging new forms of relation-aware self-attention.

Relative Position Representations remain a central, rapidly evolving ingredient in the architecture of modern attention-based deep learning systems, unifying classical notions of order, locality, and structure within flexible, scalable, and increasingly expressive neural architectures.